Re: [PATCH v4] EDAC: Add ARM64 EDAC

devicetree.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Brijesh Singh <brijeshkumar.singh@amd.com>
To: Mark Rutland <mark.rutland@arm.com>
Cc: brijeshkumar.singh@amd.com, linux-arm-kernel@lists.infradead.org,
	robh+dt@kernel.org, pawel.moll@arm.com,
	ijc+devicetree@hellion.org.uk, galak@codeaurora.org,
	dougthompson@xmission.com, bp@alien8.de, mchehab@osg.samsung.com,
	devicetree@vger.kernel.org, guohanjun@huawei.com,
	andre.przywara@arm.com, arnd@arndb.de, sboyd@codeaurora.org,
	linux-kernel@vger.kernel.org, linux-edac@vger.kernel.org
Subject: Re: [PATCH v4] EDAC: Add ARM64 EDAC
Date: Fri, 30 Oct 2015 11:26:58 -0500	[thread overview]
Message-ID: <56339A52.8060701@amd.com> (raw)
In-Reply-To: <20151028164053.GG25451@leverpostej>

Hi Mark,

>> +
>> +Required properties:
>> +- compatible: Should be "arm,cortex-a57-edac" or "arm,cortex-a53-edac"
>> +
>> +Example:
>> +	edac {
>> +		compatible = "arm,cortex-a57-edac";
>> +	};
>> +
> 
> This is insufficient for big.LITTLE, no interrupt is possible, and we
> haven't defined the rules for accessing the registers (e.g. whether
> write backs are permitted).
> 
> Please see my prior comments [1] on those points.
> 
> If we're going to use this feature directly within the kernel, we need
> to consider the envelope of possible implementations rather than your
> use-case alone.
>

I have looked at possibility of pushing correctable error logging in the
firmware; but given current hardware limitation it seems like OS is the best
place to implement it. Let me summaries the issues we are running into:

* Correctable errors does not generate any interrupt:
  If we have to implement error parsing inside the firmware then work need
  to be split between OS and firmware. Maybe OS can call SMC instruction to 
  dial into firmware and then firmware can check error syndrome registers; 
  if it finds correctable error then build HEST table. This method will introduce
  performance issue because it require OS executing SMC every 100ms or so to just
  poll for correctable error. If you have any other recommendation then please share it.

> * Interaction with firmware
> - When/do we handle interrupts?

We can a properties in dt bindings:

1) "num-interrupts = 1" - number of interrupt count. One interrupts per cluster
    e.g if you have 4 cluster then num-interrupts=4.
2) interrupts = <0, 92, 0> <0, 94, 0> <0, 96, 0> <0, 98, 0>  // interrupt mapping

If num-interrupts = 0, then firmware handles interrupts. Optionally we can use HEST FIRMWARE-FIRST
bit, if bit is set then firmware is handling the interrupt otherwise use DT information.

>
>  - When is it valid to write back and clear an error? We should not do
>    this behind the back of any firmware that owns the interface.

As per A57 TRM is concerned you are right both the correctable and uncorrectable 
error needs to clear VALID bit in L1/L2 syndrome registers. So yes we need to define
a rule for accessing the registers. I can think of two possible approach here:

1) add "error-syndrome-reg-write-access=1" property in dt.
   * if '1' then OS has exclusive write backs access to error syndrome register
   * if '0' then OS will not clear the valid bit on fatal error

  The handler looks like this:

  parse_error_syndrome () {
   val = read_cpumerrsr

   if (!IS_VALID(val))
     return

   /* log the error details */

   /* if fatal error and OS does not have exclusive write back access */
   if (IS_FATAL(val) && !error-syndrom-reg-write-access)
     return; 

   val = ~(1UL << 31); /* clear valid bit */
  }

2) Use HEST FIRMWARE-FIRST bit field, if the bit is set then OS should not clear
   the valid bit on fatal error and similarly if bit is clear then OS clears the VALID bit.

Since firmware will never handle the correctable error hence its always safe to clear
the VALID bit on non-fatal error. If you have any other suggestions then please share it.

I am not pushing my use-case only; I am trying to work through current hardware
limitation and still support all the possibilities. I am open to hear your suggestions.
I am also not well versed on big.LITTILE CPU, so you may need to point me on right 
direction as we progress. My testing is limited to Cortex A57.

> 
> I don't think the use of old_mask is sufficient here, given the mapping
> of logical to physical ID is arbitrary. For example, we could have CPUs
> 0,5,6,7 in one cluster, and CPUs 1,2,3,4 in another, and in that case
> we'd check the first cluster twice.
> 

Noted. I should use physical ID instead of logical mapping.

> This also is wrong for big.LITTLE; we can't necessarily check on every
> CPU.
>

next prev parent reply	other threads:[~2015-10-30 16:26 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-28 16:13 [PATCH v4] EDAC: Add ARM64 EDAC Brijesh Singh
     [not found] ` <1446048829-3359-1-git-send-email-brijeshkumar.singh-5C7GfCeVMHo@public.gmane.org>
2015-10-28 16:40   ` Mark Rutland
2015-10-30 16:26     ` Brijesh Singh [this message]
2015-10-30 17:06       ` Mark Rutland
2015-10-30 17:32         ` Mark Rutland
2015-10-30 17:51         ` Borislav Petkov
2015-10-30 19:12         ` Brijesh Singh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56339A52.8060701@amd.com \
    --to=brijeshkumar.singh@amd.com \
    --cc=andre.przywara@arm.com \
    --cc=arnd@arndb.de \
    --cc=bp@alien8.de \
    --cc=devicetree@vger.kernel.org \
    --cc=dougthompson@xmission.com \
    --cc=galak@codeaurora.org \
    --cc=guohanjun@huawei.com \
    --cc=ijc+devicetree@hellion.org.uk \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mchehab@osg.samsung.com \
    --cc=pawel.moll@arm.com \
    --cc=robh+dt@kernel.org \
    --cc=sboyd@codeaurora.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).