linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
From: Haren Myneni <haren@linux.vnet.ibm.com>
To: Stewart Smith <stewart@linux.ibm.com>
Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Subject: Re: crash after NX error
Date: Tue, 04 Jun 2019 22:21:42 -0700	[thread overview]
Message-ID: <5CF75166.8080301@linux.vnet.ibm.com> (raw)
In-Reply-To: <87pnnuav9d.fsf@linux.vnet.ibm.com>

On 06/03/2019 08:23 PM, Stewart Smith wrote:
> On my two socket POWER9 system (powernv) with 842 zwap set up, I
> recently got a crash with the Ubuntu kernel (I haven't tried with
> upstream, and this is the first time the system has died like this, so
> I'm not sure how repeatable it is).
> 
> [    2.891463] zswap: loaded using pool 842-nx/zbud
> ...
> [15626.124646] nx_compress_powernv: ERROR: CSB still not valid after 5000000 us, giving up : 00 00 00 00 00000000
> [16868.932913] Unable to handle kernel paging request for data at address 0x6655f67da816cdb8
> [16868.933726] Faulting instruction address: 0xc000000000391600
> 
> 
> cpu 0x68: Vector: 380 (Data Access Out of Range) at [c000001c9d98b9a0]
>     pc: c000000000391600: kmem_cache_alloc+0x2e0/0x340
>     lr: c0000000003915ec: kmem_cache_alloc+0x2cc/0x340
>     sp: c000001c9d98bc20
>    msr: 900000000280b033
>    dar: 6655f67da816cdb8
>   current = 0xc000001ad43cb400
>   paca    = 0xc00000000fac7800   softe: 0        irq_happened: 0x01
>     pid   = 8319, comm = make
> Linux version 4.15.0-50-generic (buildd@bos02-ppc64el-006) (gcc version 7.3.0 (Ubuntu 7.3.0-16ubuntu3)) #54-Ubuntu SMP Mon May 6 18:55:18 UTC 2019 (Ubuntu 4.15.0-50.54-generic 4.15.18)
> 
> 68:mon> t
> [c000001c9d98bc20] c0000000003914d4 kmem_cache_alloc+0x1b4/0x340 (unreliable)
> [c000001c9d98bc80] c0000000003b1e14 __khugepaged_enter+0x54/0x220
> [c000001c9d98bcc0] c00000000010f0ec copy_process.isra.5.part.6+0xebc/0x1a10
> [c000001c9d98bda0] c00000000010fe4c _do_fork+0xec/0x510
> [c000001c9d98be30] c00000000000b584 ppc_clone+0x8/0xc
> --- Exception: c00 (System Call) at 00007afe9daf87f4
> SP (7fffca606880) is in userspace
> 
> So, it looks like there could be a problem in the error path, plausibly
> fixed by this patch:
> 
> commit 656ecc16e8fc2ab44b3d70e3fcc197a7020d0ca5
> Author: Haren Myneni <haren@linux.vnet.ibm.com>
> Date:   Wed Jun 13 00:32:40 2018 -0700
> 
>     crypto/nx: Initialize 842 high and normal RxFIFO control registers
>     
>     NX increments readOffset by FIFO size in receive FIFO control register
>     when CRB is read. But the index in RxFIFO has to match with the
>     corresponding entry in FIFO maintained by VAS in kernel. Otherwise NX
>     may be processing incorrect CRBs and can cause CRB timeout.
>     
>     VAS FIFO offset is 0 when the receive window is opened during
>     initialization. When the module is reloaded or in kexec boot, readOffset
>     in FIFO control register may not match with VAS entry. This patch adds
>     nx_coproc_init OPAL call to reset readOffset and queued entries in FIFO
>     control register for both high and normal FIFOs.
>     
>     Signed-off-by: Haren Myneni <haren@us.ibm.com>
>     [mpe: Fixup uninitialized variable warning]
>     Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
> 
> $ git describe --contains 656ecc16e8fc2ab44b3d70e3fcc197a7020d0ca5
> v4.19-rc1~24^2~50
> 
> 
> Which was never backported to any stable release, so probably needs to
> be for v4.14 through v4.18. Notably, Ubuntu is on v4.15 and it doesn't
> seem to have picked up the patch. I'm opening an Ubuntu bug for this.
> 
> Haren, is this something you can drive through the stable process
> (assuming my above crash looks like this failure)?
> 

Thanks Stewart. Missed this in stable releases and I will work on it. Merged in Ubuntu 18.04.x kernel recently and will be in the next update.

Also need
 
commit 6e708000ec2c93c2bde6a46aa2d6c3e80d4eaeb9
Author: Haren Myneni <haren@linux.vnet.ibm.com>
Date:   Wed Jun 13 00:28:57 2018 -0700

    powerpc/powernv: Export opal_check_token symbol

    Export opal_check_token symbol for modules to check the availability
    of OPAL calls before using them.

    Signed-off-by: Haren Myneni <haren@us.ibm.com>
    Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

 





  reply	other threads:[~2019-06-05  5:23 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-06-04  3:23 crash after NX error Stewart Smith
2019-06-05  5:21 ` Haren Myneni [this message]
2019-06-05 11:06 ` Michael Ellerman
2019-06-06  2:29   ` Stewart Smith
2019-06-11  0:44   ` Haren Myneni

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5CF75166.8080301@linux.vnet.ibm.com \
    --to=haren@linux.vnet.ibm.com \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=stewart@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).