EDAC, ghes: use CPER module handles to locate DIMMs

All of lore.kernel.org
 help / color / mirror / Atom feed

From: James Morse <james.morse@arm.com>
To: Fan Wu <wufan@codeaurora.org>
Cc: mchehab@kernel.org, bp@alien8.de, baicar.tyler@gmail.com,
	linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org
Subject: EDAC, ghes: use CPER module handles to locate DIMMs
Date: Thu, 30 Aug 2018 11:48:16 +0100	[thread overview]
Message-ID: <c70fa799-bc9f-64c8-e151-088ca8c3ed3e@arm.com> (raw)

Hi Fan,

On 29/08/18 19:33, Fan Wu wrote:
> The current ghes_edac driver does not update per-dimm error
> counters when reporting memory errors, because there is no
> platform-independent way to find DIMMs based on the error
> information provided by firmware.

I'd argue there is: its in the CPER records, we just didn't do anything useful
with the information in the past!


> This patch offers a solution
> for platforms whose firmwares provide valid module handles
> (SMBIOS type 17) in error records. In this case ghes_edac will
> use the module handles to locate DIMMs and thus makes per-dimm
> error reporting possible.


> diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
> index 473aeec..db527f0 100644
> --- a/drivers/edac/ghes_edac.c
> +++ b/drivers/edac/ghes_edac.c
> @@ -81,6 +81,26 @@ static void ghes_edac_count_dimms(const struct dmi_header *dh, void *arg)
>  		(*num_dimm)++;
>  }
>  
> +static int ghes_edac_dimm_index(u16 handle)
> +{
> +	struct mem_ctl_info *mci;
> +	int i;
> +
> +	if (!ghes_pvt)
> +		return -1;

ghes_edac_report_mem_error() already checked this, as its the only caller there
is no need to check it again.


> +	mci = ghes_pvt->mci;
> +
> +	if (!mci)
> +		return -1;

Can this happen? ghes_edac_report_mem_error() would have dereferenced this already!

If you need the struct mem_ctl_info, you may as well pass it in as the only
caller has it to hand.


> +
> +	for (i = 0; i < mci->tot_dimms; i++) {
> +		if (mci->dimms[i]->smbios_handle == handle)
> +			return i;
> +	}
> +	return -1;
> +}
> +
>  static void ghes_edac_dmidecode(const struct dmi_header *dh, void *arg)
>  {
>  	struct ghes_edac_dimm_fill *dimm_fill = arg;
> @@ -177,6 +197,8 @@ static void ghes_edac_dmidecode(const struct dmi_header *dh, void *arg)
>  				entry->total_width, entry->data_width);
>  		}
>  
> +		dimm->smbios_handle = entry->handle;

We aren't checking for duplicate handles, (e.g. they're all zero). I think this
is fine as chances are firmware on those systems won't set
CPER_MEM_VALID_MODULE_HANDLE. If it does, the handle it gives us is ambiguous,
and we pick a dimm, instead of whine-ing about broken firmware tables.

(I'm just drawing attention to it in case someone disagrees)


>  		dimm_fill->count++;
>  	}
>  }
> @@ -327,12 +349,20 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
>  		p += sprintf(p, "bit_pos:%d ", mem_err->bit_pos);
>  	if (mem_err->validation_bits & CPER_MEM_VALID_MODULE_HANDLE) {
>  		const char *bank = NULL, *device = NULL;
> +		int index = -1;
> +
>  		dmi_memdev_name(mem_err->mem_dev_handle, &bank, &device);

> +		p += sprintf(p, "DIMM DMI handle: 0x%.4x ",
> +			     mem_err->mem_dev_handle);
>  		if (bank != NULL && device != NULL)
>  			p += sprintf(p, "DIMM location:%s %s ", bank, device);
> -		else
> -			p += sprintf(p, "DIMM DMI handle: 0x%.4x ",
> -				     mem_err->mem_dev_handle);

Why do we now print the handle every time? The handle is pretty meaningless, it
can only be used to find the location-strings, if we get those we print them
instead.


> +		index = ghes_edac_dimm_index(mem_err->mem_dev_handle);
> +		if (index >= 0) {
> +			e->top_layer = index;
> +			e->enable_per_layer_report = true;
> +		}
> +
>  	}
>  	if (p > e->location)
>  		*(p - 1) = '\0';

Looks good to me!


Thanks,

James

WARNING: multiple messages have this Message-ID (diff)

From: james.morse@arm.com (James Morse)
To: linux-arm-kernel@lists.infradead.org
Subject: [PATCH] EDAC, ghes: use CPER module handles to locate DIMMs
Date: Thu, 30 Aug 2018 11:48:16 +0100	[thread overview]
Message-ID: <c70fa799-bc9f-64c8-e151-088ca8c3ed3e@arm.com> (raw)
In-Reply-To: <1535567632-18089-1-git-send-email-wufan@codeaurora.org>

Hi Fan,

On 29/08/18 19:33, Fan Wu wrote:
> The current ghes_edac driver does not update per-dimm error
> counters when reporting memory errors, because there is no
> platform-independent way to find DIMMs based on the error
> information provided by firmware.

I'd argue there is: its in the CPER records, we just didn't do anything useful
with the information in the past!


> This patch offers a solution
> for platforms whose firmwares provide valid module handles
> (SMBIOS type 17) in error records. In this case ghes_edac will
> use the module handles to locate DIMMs and thus makes per-dimm
> error reporting possible.


> diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
> index 473aeec..db527f0 100644
> --- a/drivers/edac/ghes_edac.c
> +++ b/drivers/edac/ghes_edac.c
> @@ -81,6 +81,26 @@ static void ghes_edac_count_dimms(const struct dmi_header *dh, void *arg)
>  		(*num_dimm)++;
>  }
>  
> +static int ghes_edac_dimm_index(u16 handle)
> +{
> +	struct mem_ctl_info *mci;
> +	int i;
> +
> +	if (!ghes_pvt)
> +		return -1;

ghes_edac_report_mem_error() already checked this, as its the only caller there
is no need to check it again.


> +	mci = ghes_pvt->mci;
> +
> +	if (!mci)
> +		return -1;

Can this happen? ghes_edac_report_mem_error() would have dereferenced this already!

If you need the struct mem_ctl_info, you may as well pass it in as the only
caller has it to hand.


> +
> +	for (i = 0; i < mci->tot_dimms; i++) {
> +		if (mci->dimms[i]->smbios_handle == handle)
> +			return i;
> +	}
> +	return -1;
> +}
> +
>  static void ghes_edac_dmidecode(const struct dmi_header *dh, void *arg)
>  {
>  	struct ghes_edac_dimm_fill *dimm_fill = arg;
> @@ -177,6 +197,8 @@ static void ghes_edac_dmidecode(const struct dmi_header *dh, void *arg)
>  				entry->total_width, entry->data_width);
>  		}
>  
> +		dimm->smbios_handle = entry->handle;

We aren't checking for duplicate handles, (e.g. they're all zero). I think this
is fine as chances are firmware on those systems won't set
CPER_MEM_VALID_MODULE_HANDLE. If it does, the handle it gives us is ambiguous,
and we pick a dimm, instead of whine-ing about broken firmware tables.

(I'm just drawing attention to it in case someone disagrees)


>  		dimm_fill->count++;
>  	}
>  }
> @@ -327,12 +349,20 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
>  		p += sprintf(p, "bit_pos:%d ", mem_err->bit_pos);
>  	if (mem_err->validation_bits & CPER_MEM_VALID_MODULE_HANDLE) {
>  		const char *bank = NULL, *device = NULL;
> +		int index = -1;
> +
>  		dmi_memdev_name(mem_err->mem_dev_handle, &bank, &device);

> +		p += sprintf(p, "DIMM DMI handle: 0x%.4x ",
> +			     mem_err->mem_dev_handle);
>  		if (bank != NULL && device != NULL)
>  			p += sprintf(p, "DIMM location:%s %s ", bank, device);
> -		else
> -			p += sprintf(p, "DIMM DMI handle: 0x%.4x ",
> -				     mem_err->mem_dev_handle);

Why do we now print the handle every time? The handle is pretty meaningless, it
can only be used to find the location-strings, if we get those we print them
instead.


> +		index = ghes_edac_dimm_index(mem_err->mem_dev_handle);
> +		if (index >= 0) {
> +			e->top_layer = index;
> +			e->enable_per_layer_report = true;
> +		}
> +
>  	}
>  	if (p > e->location)
>  		*(p - 1) = '\0';

Looks good to me!


Thanks,

James

WARNING: multiple messages have this Message-ID (diff)

From: James Morse <james.morse@arm.com>
To: Fan Wu <wufan@codeaurora.org>
Cc: mchehab@kernel.org, bp@alien8.de, baicar.tyler@gmail.com,
	linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org
Subject: Re: [PATCH] EDAC, ghes: use CPER module handles to locate DIMMs
Date: Thu, 30 Aug 2018 11:48:16 +0100	[thread overview]
Message-ID: <c70fa799-bc9f-64c8-e151-088ca8c3ed3e@arm.com> (raw)
In-Reply-To: <1535567632-18089-1-git-send-email-wufan@codeaurora.org>

Hi Fan,

On 29/08/18 19:33, Fan Wu wrote:
> The current ghes_edac driver does not update per-dimm error
> counters when reporting memory errors, because there is no
> platform-independent way to find DIMMs based on the error
> information provided by firmware.

I'd argue there is: its in the CPER records, we just didn't do anything useful
with the information in the past!


> This patch offers a solution
> for platforms whose firmwares provide valid module handles
> (SMBIOS type 17) in error records. In this case ghes_edac will
> use the module handles to locate DIMMs and thus makes per-dimm
> error reporting possible.


> diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
> index 473aeec..db527f0 100644
> --- a/drivers/edac/ghes_edac.c
> +++ b/drivers/edac/ghes_edac.c
> @@ -81,6 +81,26 @@ static void ghes_edac_count_dimms(const struct dmi_header *dh, void *arg)
>  		(*num_dimm)++;
>  }
>  
> +static int ghes_edac_dimm_index(u16 handle)
> +{
> +	struct mem_ctl_info *mci;
> +	int i;
> +
> +	if (!ghes_pvt)
> +		return -1;

ghes_edac_report_mem_error() already checked this, as its the only caller there
is no need to check it again.


> +	mci = ghes_pvt->mci;
> +
> +	if (!mci)
> +		return -1;

Can this happen? ghes_edac_report_mem_error() would have dereferenced this already!

If you need the struct mem_ctl_info, you may as well pass it in as the only
caller has it to hand.


> +
> +	for (i = 0; i < mci->tot_dimms; i++) {
> +		if (mci->dimms[i]->smbios_handle == handle)
> +			return i;
> +	}
> +	return -1;
> +}
> +
>  static void ghes_edac_dmidecode(const struct dmi_header *dh, void *arg)
>  {
>  	struct ghes_edac_dimm_fill *dimm_fill = arg;
> @@ -177,6 +197,8 @@ static void ghes_edac_dmidecode(const struct dmi_header *dh, void *arg)
>  				entry->total_width, entry->data_width);
>  		}
>  
> +		dimm->smbios_handle = entry->handle;

We aren't checking for duplicate handles, (e.g. they're all zero). I think this
is fine as chances are firmware on those systems won't set
CPER_MEM_VALID_MODULE_HANDLE. If it does, the handle it gives us is ambiguous,
and we pick a dimm, instead of whine-ing about broken firmware tables.

(I'm just drawing attention to it in case someone disagrees)


>  		dimm_fill->count++;
>  	}
>  }
> @@ -327,12 +349,20 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
>  		p += sprintf(p, "bit_pos:%d ", mem_err->bit_pos);
>  	if (mem_err->validation_bits & CPER_MEM_VALID_MODULE_HANDLE) {
>  		const char *bank = NULL, *device = NULL;
> +		int index = -1;
> +
>  		dmi_memdev_name(mem_err->mem_dev_handle, &bank, &device);

> +		p += sprintf(p, "DIMM DMI handle: 0x%.4x ",
> +			     mem_err->mem_dev_handle);
>  		if (bank != NULL && device != NULL)
>  			p += sprintf(p, "DIMM location:%s %s ", bank, device);
> -		else
> -			p += sprintf(p, "DIMM DMI handle: 0x%.4x ",
> -				     mem_err->mem_dev_handle);

Why do we now print the handle every time? The handle is pretty meaningless, it
can only be used to find the location-strings, if we get those we print them
instead.


> +		index = ghes_edac_dimm_index(mem_err->mem_dev_handle);
> +		if (index >= 0) {
> +			e->top_layer = index;
> +			e->enable_per_layer_report = true;
> +		}
> +
>  	}
>  	if (p > e->location)
>  		*(p - 1) = '\0';

Looks good to me!


Thanks,

James

next             reply	other threads:[~2018-08-30 10:48 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-08-30 10:48 James Morse [this message]
2018-08-30 10:48 ` [PATCH] EDAC, ghes: use CPER module handles to locate DIMMs James Morse
2018-08-30 10:48 ` James Morse
  -- strict thread matches above, loose matches on Subject: below --
2018-09-03 19:18 Borislav Petkov
2018-09-03 19:18 ` [PATCH] " Borislav Petkov
2018-09-03 19:18 ` Borislav Petkov
2018-09-03 15:05 wufan
2018-09-03 15:05 ` [PATCH] " wufan
2018-09-03 15:05 ` wufan
2018-08-31 10:06 tanxiaofei
2018-08-31 10:06 ` [PATCH] " tanxiaofei
2018-08-31 10:06 ` tanxiaofei
2018-08-30 17:11 wufan
2018-08-30 17:11 ` [PATCH] " wufan
2018-08-30 17:11 ` wufan
2018-08-30 16:50 John Garry
2018-08-30 16:50 ` [PATCH] " John Garry
2018-08-30 16:50 ` John Garry
2018-08-30 16:46 Tyler Baicar
2018-08-30 16:46 ` [PATCH] " Tyler Baicar
2018-08-30 16:46 ` Tyler Baicar
2018-08-30 16:45 wufan
2018-08-30 16:45 ` [PATCH] " wufan
2018-08-30 16:45 ` wufan
2018-08-30 16:34 James Morse
2018-08-30 16:34 ` [PATCH] " James Morse
2018-08-30 16:34 ` James Morse
2018-08-30 16:34 James Morse
2018-08-30 16:34 ` [PATCH] " James Morse
2018-08-30 16:34 ` James Morse
2018-08-30 16:32 James Morse
2018-08-30 16:32 ` [PATCH] " James Morse
2018-08-30 16:32 ` James Morse
2018-08-30 15:12 Borislav Petkov
2018-08-30 15:12 ` [PATCH] " Boris Petkov
2018-08-30 15:12 ` Boris Petkov
2018-08-30 14:40 wufan
2018-08-30 14:40 ` [PATCH] " wufan
2018-08-30 14:40 ` wufan
2018-08-30 14:20 wufan
2018-08-30 14:20 ` [PATCH] " wufan
2018-08-30 14:20 ` wufan
2018-08-30 10:43 Borislav Petkov
2018-08-30 10:43 ` [PATCH] " Borislav Petkov
2018-08-30 10:43 ` Borislav Petkov
2018-08-29 18:33 wufan
2018-08-29 18:33 ` [PATCH] " Fan Wu
2018-08-29 18:33 ` Fan Wu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c70fa799-bc9f-64c8-e151-088ca8c3ed3e@arm.com \
    --to=james.morse@arm.com \
    --cc=baicar.tyler@gmail.com \
    --cc=bp@alien8.de \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mchehab@kernel.org \
    --cc=wufan@codeaurora.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.