public inbox for linux-edac@vger.kernel.org
 help / color / mirror / Atom feed
From: Yazen Ghannam <yazen.ghannam@amd.com>
To: Muralidhara M K <muralimk@amd.com>,
	linux-edac@vger.kernel.org, x86@kernel.org
Cc: yazen.ghannam@amd.com, linux-kernel@vger.kernel.org,
	bp@alien8.de, mchehab@kernel.org,
	Muralidhara M K <muralidhara.mk@amd.com>
Subject: Re: [PATCH 7/7] EDAC/amd64: RAS: platform/x86/amd: Identify all physical pages in row
Date: Thu, 26 Oct 2023 10:31:04 -0400	[thread overview]
Message-ID: <f9e2d199-1813-4b4a-83fd-bf93919a3411@amd.com> (raw)
In-Reply-To: <20231025073339.630093-8-muralimk@amd.com>

On 10/25/2023 3:33 AM, Muralidhara M K wrote:
> From: Muralidhara M K <muralidhara.mk@amd.com>


The $SUBJECT needs to be updated.

> 
> AMD systems have HBM memory embedded with the chips, The entire memory
> is managed by host OS. Error containment needs to be reliable, because
> HBM memory cannot be replaced.
> 
> Persist all UMC DRAM ECC errors, the OS can make the bad or poisoned page
> state persistent so that it will not use the memory upon the next boot.
> 

There is nothing in this patch regarding persistence. It finds all 
system physical addresses covering a DRAM row and request their pages to 
be retired.

> The reported MCA error address in HBM in the format PC/SID/Bank/ROW/COL
> For example, In MI300A C1/C0 (column bits 1-0) is at SPA bit 6-5. Assuming
> PFN only looks at SPA bit 12 or higher, column bits 1-0 could be skipped.
> For PFN, SPA bits higher or equal than 12 matters. So column bits c2, c3
> and c4 gives 8 possible combination of addresses in a row.
> 
> So, Identify all physical pages in a HBM row and retire all the pages
> to get rid of intermittent or recurrent memory errors.
> 

There are a number of grammatical errors in the commit message. Please 
fix them.

> Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
> ---
>   drivers/edac/amd64_edac.c |   5 ++
>   drivers/ras/amd/atl/umc.c | 103 ++++++++++++++++++++++++++++++++++++++
>   include/linux/amd-atl.h   |   2 +
>   3 files changed, 110 insertions(+)
> 
> diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
> index 79c6c552ee14..d0db11e19a46 100644
> --- a/drivers/edac/amd64_edac.c
> +++ b/drivers/edac/amd64_edac.c
> @@ -2838,6 +2838,11 @@ static void decode_umc_error(int node_id, struct mce *m)
>   
>   	error_address_to_page_and_offset(sys_addr, &err);
>   
> +	if (pvt->fam == 0x19 && (pvt->model >= 0x90 && pvt->model <= 0x9f)) {
> +		if (identify_poison_pages_retire_row(m))
> +			return;

EDAC can still log the original error. So why return early here?

> +	}
> +
>   log_error:
>   	__log_ecc_error(mci, &err, ecc_type);
>   }
> diff --git a/drivers/ras/amd/atl/umc.c b/drivers/ras/amd/atl/umc.c
> index 52247a7949fb..d31ad7680ff1 100644
> --- a/drivers/ras/amd/atl/umc.c
> +++ b/drivers/ras/amd/atl/umc.c
> @@ -255,3 +255,106 @@ int umc_mca_addr_to_sys_addr(struct mce *m, u64 *sys_addr)
>   	return 0;
>   }
>   EXPORT_SYMBOL_GPL(umc_mca_addr_to_sys_addr);
> +
> +/*
> + * High Bandwidth Memory (HBM v3) has fixed number of columns in a
> + * row (8 columns in one HBM row).
> + * Extract column bits to find all the combination of masks to retire
> + * all the poison pages in a row.
> + */
> +#define MAX_COLUMNS_IN_HBM_ROW	8

Is this true in general? Or is it specific to a product?

> +
> +/* The C2 bit in CH NA address */
> +#define UMC_NA_C2_BIT	BIT(8)
> +/* The C3 bit in CH NA address */
> +#define UMC_NA_C3_BIT	BIT(9)
> +/* The C4 bit in CH NA address */
> +#define UMC_NA_C4_BIT	BIT(14)
> +

C2/C3/C4 is unclear in the comments. Please expand these to be more 
explicit, like Column 2 bit, etc.

> +/* masks to get all possible combinations of column addresses */
> +#define C_1_1_1_MASK	(UMC_NA_C4_BIT | UMC_NA_C3_BIT | UMC_NA_C2_BIT)
> +#define C_1_1_0_MASK	(UMC_NA_C4_BIT | UMC_NA_C3_BIT)
> +#define C_1_0_1_MASK	(UMC_NA_C4_BIT | UMC_NA_C2_BIT)
> +#define C_1_0_0_MASK	(UMC_NA_C4_BIT)
> +#define C_0_1_1_MASK	(UMC_NA_C3_BIT | UMC_NA_C2_BIT)
> +#define C_0_1_0_MASK	(UMC_NA_C3_BIT)
> +#define C_0_0_1_MASK	(UMC_NA_C2_BIT)
> +#define C_0_0_0_MASK	~C_1_1_1_MASK
> +
> +/* Identify all combination of column address physical pages in a row */

This comment is not clear to me.

> +static int amd_umc_identify_pages_in_row(struct mce *m, u64 *spa_addr)

Also, this function does not identify pages. It identifies system 
physical addresses.

Additionally, this function seems very much specific to this 
implementation of HBM3. So the function name should indicate this.

> +{
> +	u8 cs_inst_id = get_cs_inst_id(m);
> +	u8 socket_id = get_socket_id(m);
> +	u64 norm_addr = get_norm_addr(m);
> +	u8 die_id = get_die_id(m);
> +	u16 df_acc_id = get_df_acc_id(m);
> +
> +	u64 retire_addr, column;
> +	u64 column_masks[] = { 0, C_0_0_1_MASK, C_0_1_0_MASK, C_0_1_1_MASK,
> +			C_1_0_0_MASK, C_1_0_1_MASK, C_1_1_0_MASK, C_1_1_1_MASK };
> +
> +	/* clear and loop for all possibilities of [c4 c3 c2] */
> +	norm_addr &= C_0_0_0_MASK;
> +
> +	for (column = 0; column < ARRAY_SIZE(column_masks); column++) {
> +		retire_addr = norm_addr | column_masks[column];
> +
> +		if (norm_to_sys_addr(df_acc_id, socket_id, die_id, cs_inst_id, &retire_addr))
> +			return -EINVAL;

Why return if a single translation fails? What if the other seven can 
succeed? Wouldn't it be better to find and offline 7/8 possible bad 
addresses than 0/8?

> +		*(spa_addr + column) = retire_addr;
> +	}
> +
> +	return 0;
> +}
> +
> +/* Find any duplicate addresses in all combination of column address */
> +static void amd_umc_find_duplicate_spa(u64 arr[], int *size)
> +{
> +	int i, j, k;
> +
> +	/* use nested for loop to find the duplicate elements in array */
> +	for (i = 0; i < *size; i++) {
> +		for (j = i + 1; j < *size; j++) {
> +			/* check duplicate element */
> +			if (arr[i] == arr[j]) {
> +				/* delete the current position of the duplicate element */
> +				for (k = j; k < (*size - 1); k++)
> +					arr[k] = arr[k + 1];
> +
> +			/* decrease the size of array after removing duplicate element */
> +				(*size)--;
> +
> +			/* if the position of the elements is changes, don't increase index j */
> +				j--;
> +			}
> +		}
> +	}
> +}

Is it really necessary to de-duplicate this array?

This data is discarded after the page retirement step. So checking for 
duplicates can be done there.

> +
> +int identify_poison_pages_retire_row(struct mce *m)
> +{
> +	int i, ret, addr_range;
> +	unsigned long pfn;
> +	u64 col[MAX_COLUMNS_IN_HBM_ROW];
> +	u64 *spa_addr = col;
> +

Just use "col[]"; *spa_addr is not needed.

Also, please order variable declarations from longest to shortest by 
line length.

> +	/* Identify all pages in a row */

Comment is not needed.

> +	pr_info("Identify all physical Pages in a row for MCE addr:0x%llx\n", m->addr);
> +	ret = amd_umc_identify_pages_in_row(m, spa_addr);
> +	if (!ret) {

If this succeeds, then you print info. And if it fails, then you don't 
print, but continue to process information. This doesn't seem correct.

> +		for (i = 0; i < MAX_COLUMNS_IN_HBM_ROW; i++)
> +			pr_info("col[%d]_addr:0x%llx ", i, spa_addr[i]);
> +	}
> +	/* Find duplicate entries from all 8 physical addresses in a row */
> +	addr_range = ARRAY_SIZE(col);

You already know the array size; you defined it above.

> +	amd_umc_find_duplicate_spa(spa_addr, &addr_range);
> +	/* do page retirement on all system physical addresses */
> +	for (i = 0; i < addr_range; i++) {

You can check for duplicates here. If a value is a duplicate, then 
continue the loop.

> +		pfn = PHYS_PFN(spa_addr[i]);
> +		memory_failure(pfn, 0);
> +	}
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(identify_poison_pages_retire_row);

A namespace prefix is needed for exported functions, i.e. amd_atl_* or 
similar.

Thanks,
Yazen

  reply	other threads:[~2023-10-26 14:31 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-25  7:33 [PATCH 0/7] Address Translation support for MI200 and MI300 models Muralidhara M K
2023-10-25  7:33 ` [PATCH 1/7] RAS: Add Address Translation support for MI200 Muralidhara M K
2023-10-26  8:44   ` kernel test robot
2023-10-25  7:33 ` [PATCH 2/7] RAS: Add Address Translation support for MI300 Muralidhara M K
2023-10-25  7:33 ` [PATCH 3/7] RAS: Add MCA Error address conversion for UMC Muralidhara M K
2023-10-25  7:33 ` [PATCH 4/7] RAS: Add static lookup table to get CS physical ID Muralidhara M K
2023-10-25  7:33 ` [PATCH 5/7] RAS: Add fixed Physical to logical CS ID mapping table Muralidhara M K
2023-10-25  7:33 ` [PATCH 6/7] RAS: Get CS fabirc ID register bit fields Muralidhara M K
2023-10-26 13:55   ` Yazen Ghannam
2023-10-25  7:33 ` [PATCH 7/7] EDAC/amd64: RAS: platform/x86/amd: Identify all physical pages in row Muralidhara M K
2023-10-26 14:31   ` Yazen Ghannam [this message]
2023-10-26 13:21 ` [PATCH 0/7] Address Translation support for MI200 and MI300 models Yazen Ghannam

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f9e2d199-1813-4b4a-83fd-bf93919a3411@amd.com \
    --to=yazen.ghannam@amd.com \
    --cc=bp@alien8.de \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mchehab@kernel.org \
    --cc=muralidhara.mk@amd.com \
    --cc=muralimk@amd.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox