All of lore.kernel.org
 help / color / mirror / Atom feed
From: Joakim Tjernlund <Joakim.Tjernlund@infinera.com>
To: "linuxppc-dev@lists.ozlabs.org" <linuxppc-dev@lists.ozlabs.org>,
	"leoyang.li@nxp.com" <leoyang.li@nxp.com>,
	"york.sun@nxp.com" <york.sun@nxp.com>
Subject: Re: Machine Check in P2010(e500v2)
Date: Fri, 8 Sep 2017 09:54:25 +0000	[thread overview]
Message-ID: <1504864463.31322.31.camel@infinera.com> (raw)
In-Reply-To: <AM4PR0401MB16993C0235EA67562A8B8D4E8F940@AM4PR0401MB1699.eurprd04.prod.outlook.com>

On Thu, 2017-09-07 at 18:54 +0000, Leo Li wrote:
> > -----Original Message-----
> > From: Joakim Tjernlund [mailto:Joakim.Tjernlund@infinera.com]
> > Sent: Thursday, September 07, 2017 3:41 AM
> > To: linuxppc-dev@lists.ozlabs.org; Leo Li <leoyang.li@nxp.com>; York Su=
n
> > <york.sun@nxp.com>
> > Subject: Re: Machine Check in P2010(e500v2)
> >=20
> > On Thu, 2017-09-07 at 00:50 +0200, Joakim Tjernlund wrote:
> > > On Wed, 2017-09-06 at 21:13 +0000, Leo Li wrote:
> > > > > -----Original Message-----
> > > > > From: Joakim Tjernlund [mailto:Joakim.Tjernlund@infinera.com]
> > > > > Sent: Wednesday, September 06, 2017 3:54 PM
> > > > > To: linuxppc-dev@lists.ozlabs.org; Leo Li <leoyang.li@nxp.com>;
> > > > > York Sun <york.sun@nxp.com>
> > > > > Subject: Re: Machine Check in P2010(e500v2)
> > > > >=20
> > > > > On Wed, 2017-09-06 at 20:28 +0000, Leo Li wrote:
> > > > > > > -----Original Message-----
> > > > > > > From: Joakim Tjernlund [mailto:Joakim.Tjernlund@infinera.com]
> > > > > > > Sent: Wednesday, September 06, 2017 3:17 PM
> > > > > > > To: linuxppc-dev@lists.ozlabs.org; Leo Li
> > > > > > > <leoyang.li@nxp.com>; York Sun <york.sun@nxp.com>
> > > > > > > Subject: Re: Machine Check in P2010(e500v2)
> > > > > > >=20
> > > > > > > On Wed, 2017-09-06 at 19:31 +0000, Leo Li wrote:
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: York Sun
> > > > > > > > > Sent: Wednesday, September 06, 2017 10:38 AM
> > > > > > > > > To: Joakim Tjernlund <Joakim.Tjernlund@infinera.com>;
> > > > > > > > > linuxppc- dev@lists.ozlabs.org; Leo Li
> > > > > > > > > <leoyang.li@nxp.com>
> > > > > > > > > Subject: Re: Machine Check in P2010(e500v2)
> > > > > > > > >=20
> > > > > > > > > Scott is no longer with Freescale/NXP. Adding Leo.
> > > > > > > > >=20
> > > > > > > > > On 09/05/2017 01:40 AM, Joakim Tjernlund wrote:
> > > > > > > > > > So after some debugging I found this bug:
> > > > > > > > > > @@ -996,7 +998,7 @@ int fsl_pci_mcheck_exception(struct
> > > > > > > > > > pt_regs
> > > > >=20
> > > > > *regs)
> > > > > > > > > >          if (is_in_pci_mem_space(addr)) {
> > > > > > > > > >                  if (user_mode(regs)) {
> > > > > > > > > >                          pagefault_disable();
> > > > > > > > > > -                       ret =3D get_user(regs->nip, &in=
st);
> > > > > > > > > > +                       ret =3D get_user(inst, (__u32
> > > > > > > > > > + __user *)regs->nip);
> > > > > > > > > >                          pagefault_enable();
> > > > > > > > > >                  } else {
> > > > > > > > > >                          ret =3D
> > > > > > > > > > probe_kernel_address(regs->nip, inst);
> > > > > > > > > >=20
> > > > > > > > > > However, the kernel still locked up after fixing that.
> > > > > > > > > > Now I wonder why this fixup is there in the first place=
?
> > > > > > > > > > The routine will not really fixup the insn, just return
> > > > > > > > > > 0xffffffff for the failing read and then advance the pr=
ocess NIP.
> > > > > > > >=20
> > > > > > > > You are right.  The code here only gives 0xffffffff to the
> > > > > > > > load instructions and
> > > > > > >=20
> > > > > > > continue with the next instruction when the load instruction
> > > > > > > is causing the machine check.  This will prevent a system
> > > > > > > lockup when reading from PCI/RapidIO device which is link dow=
n.
> > > > > > > >=20
> > > > > > > > I don't know what is actual problem in your case.  Maybe it
> > > > > > > > is a write
> > > > > > >=20
> > > > > > > instruction instead of read?   Or the code is in a infinite l=
oop waiting for
> >=20
> > a
> > > > >=20
> > > > > valid
> > > > > > > read result?  Are you able to do some further debugging with
> > > > > > > the NIP correctly printed?
> > > > > > > >=20
> > > > > > >=20
> > > > > > > According to the MC it is a Read and the NIP also leads to a
> > > > > > > read in the
> > > > >=20
> > > > > program.
> > > > > > > ATM, I have disabled the fixup but I will enable that again.
> > > > > > > Question, is it safe add a small printk when this MC
> > > > > > > happens(after fixing up)? I need to see that it has happened
> > > > > > > as the error is somewhat
> > > > >=20
> > > > > random.
> > > > > >=20
> > > > > > I think it is safe to add printk as the current machine check
> > > > > > handlers are also
> > > > >=20
> > > > > using printk.
> > > > >=20
> > > > > I hope so, but if the fixup fires there is no printk at all so I =
was a bit unsure.
> > > > > Don't like this fixup though, is there not a better way than
> > > > > faking a read to user space(or kernel for that matter) ?
> > > >=20
> > > > I don't have a better idea.  Without the fixup, the offending load =
instruction
> >=20
> > will never finish if there is anything wrong with the backing device an=
d freeze the
> > whole system.  Do you have any suggestion in mind?
> > > >=20
> > >=20
> > > But it never finishes the load, it just fakes a load of 0xfffffffff,
> > > for user space I rather have it signal a SIGBUS but that does not see=
m
> > > to work either, at least not for us but that could be a bug in genera=
l MC code
> >=20
> > maybe.
> > > This fixup might be valid for kernel only as it has never worked for =
user space
> >=20
> > due to the bug I found.
> > >=20
> > > Where can I read about this errata ?
> >=20
> > I have look high and low an cannot find an errata which maps to this fi=
xup.
> > The closest I get is A-005125 which seems to have another workaround, I=
 cannot
> > find any evidence that this workaround has been applied in Linux, can y=
ou?
>=20
> This is not A-005125.  There was an erratum for this issue with older sil=
icons (e.g. erratum PCI-ex 3 for MPC8572). =20
> " When its link goes down, the PCI Express controller clears all outstand=
ing transactions with an
> error indicator and sends a link down exception to the interrupt controll=
er if
> PEX_PME_MES_DISR[LDDD] =3D 0. If, however, any transactions are sent to t=
he controller after
> the link down event, they are accepted by the controller and wait for the=
 link to come back up
> before starting any timeout counters (for example, completion timeout). T=
here is no mechanism to
> cancel the new transactions short of a device HRESET. "
>
> But it was removed in newer silicon like P2020/P2010 probably because a M=
achine Check will be triggered in this situation to deal with the stalled i=
nstruction and no longer considered it as a hardware issue.
>=20

Maybe this fixup should be configurable then?

> The A-005125 is dealt with in u-boot.   https://lists.denx.de/pipermail/u=
-boot/2013-August/161185.html

Yes, I found it eventually :)

However, I cannot return to normal execution. I can follow the code to retu=
rning from
machine_check_exception() and moving into ASM handler for returning from a =
ME but then I
am a bit lost. It does not seem to be any problem executing, it feels more =
like a SW bug
dealing with machine checks. Don't known how to diagnose this further and c=
ould use some pointers.

 Jocke=

  reply	other threads:[~2017-09-08  9:54 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-09-01 11:32 Machine Check in P2010(e500v2) Joakim Tjernlund
2017-09-05  8:40 ` Joakim Tjernlund
2017-09-06 15:38   ` York Sun
2017-09-06 19:31     ` Leo Li
2017-09-06 20:17       ` Joakim Tjernlund
2017-09-06 20:28         ` Leo Li
2017-09-06 20:53           ` Joakim Tjernlund
2017-09-06 21:13             ` Leo Li
2017-09-06 22:50               ` Joakim Tjernlund
2017-09-07  8:41                 ` Joakim Tjernlund
2017-09-07 18:54                   ` Leo Li
2017-09-08  9:54                     ` Joakim Tjernlund [this message]
2017-09-08 12:50                       ` Joakim Tjernlund
2017-09-08 22:27                         ` Leo Li
2017-09-09 12:45                           ` Joakim Tjernlund
     [not found]                             ` <1504961965.31322.72.camel@infinera.com>
2017-09-14 16:55                               ` Joakim Tjernlund
2017-09-20 16:45                             ` Joakim Tjernlund
2017-09-21 18:53                               ` Leo Li
2017-09-06 10:05 ` Laurentiu Tudor
2017-09-06 10:16   ` Joakim Tjernlund
2017-09-08  1:56     ` Scott Wood

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1504864463.31322.31.camel@infinera.com \
    --to=joakim.tjernlund@infinera.com \
    --cc=leoyang.li@nxp.com \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=york.sun@nxp.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.