From: Stewart Smith <stewart@linux.ibm.com>
To: Michael Ellerman <mpe@ellerman.id.au>,
linuxppc-dev <linuxppc-dev@lists.ozlabs.org>,
haren@linux.vnet.ibm.com
Subject: Re: crash after NX error
Date: Thu, 06 Jun 2019 12:29:38 +1000 [thread overview]
Message-ID: <87lfyfa1kd.fsf@linux.vnet.ibm.com> (raw)
In-Reply-To: <87zhmwmgv7.fsf@concordia.ellerman.id.au>
Michael Ellerman <mpe@ellerman.id.au> writes:
> Stewart Smith <stewart@linux.ibm.com> writes:
>> On my two socket POWER9 system (powernv) with 842 zwap set up, I
>> recently got a crash with the Ubuntu kernel (I haven't tried with
>> upstream, and this is the first time the system has died like this, so
>> I'm not sure how repeatable it is).
>>
>> [ 2.891463] zswap: loaded using pool 842-nx/zbud
>> ...
>> [15626.124646] nx_compress_powernv: ERROR: CSB still not valid after 5000000 us, giving up : 00 00 00 00 00000000
>> [16868.932913] Unable to handle kernel paging request for data at address 0x6655f67da816cdb8
>> [16868.933726] Faulting instruction address: 0xc000000000391600
>>
>>
>> cpu 0x68: Vector: 380 (Data Access Out of Range) at [c000001c9d98b9a0]
>> pc: c000000000391600: kmem_cache_alloc+0x2e0/0x340
>> lr: c0000000003915ec: kmem_cache_alloc+0x2cc/0x340
>> sp: c000001c9d98bc20
>> msr: 900000000280b033
>> dar: 6655f67da816cdb8
>> current = 0xc000001ad43cb400
>> paca = 0xc00000000fac7800 softe: 0 irq_happened: 0x01
>> pid = 8319, comm = make
>> Linux version 4.15.0-50-generic (buildd@bos02-ppc64el-006) (gcc version 7.3.0 (Ubuntu 7.3.0-16ubuntu3)) #54-Ubuntu SMP Mon May 6 18:55:18 UTC 2019 (Ubuntu 4.15.0-50.54-generic 4.15.18)
>>
>> 68:mon> t
>> [c000001c9d98bc20] c0000000003914d4 kmem_cache_alloc+0x1b4/0x340 (unreliable)
>> [c000001c9d98bc80] c0000000003b1e14 __khugepaged_enter+0x54/0x220
>> [c000001c9d98bcc0] c00000000010f0ec copy_process.isra.5.part.6+0xebc/0x1a10
>> [c000001c9d98bda0] c00000000010fe4c _do_fork+0xec/0x510
>> [c000001c9d98be30] c00000000000b584 ppc_clone+0x8/0xc
>> --- Exception: c00 (System Call) at 00007afe9daf87f4
>> SP (7fffca606880) is in userspace
>>
>> So, it looks like there could be a problem in the error path, plausibly
>> fixed by this patch:
>>
>> commit 656ecc16e8fc2ab44b3d70e3fcc197a7020d0ca5
>> Author: Haren Myneni <haren@linux.vnet.ibm.com>
>> Date: Wed Jun 13 00:32:40 2018 -0700
>>
>> crypto/nx: Initialize 842 high and normal RxFIFO control registers
>>
>> NX increments readOffset by FIFO size in receive FIFO control register
>> when CRB is read. But the index in RxFIFO has to match with the
>> corresponding entry in FIFO maintained by VAS in kernel. Otherwise NX
>> may be processing incorrect CRBs and can cause CRB timeout.
>>
>> VAS FIFO offset is 0 when the receive window is opened during
>> initialization. When the module is reloaded or in kexec boot, readOffset
>> in FIFO control register may not match with VAS entry. This patch adds
>> nx_coproc_init OPAL call to reset readOffset and queued entries in FIFO
>> control register for both high and normal FIFOs.
>>
>> Signed-off-by: Haren Myneni <haren@us.ibm.com>
>> [mpe: Fixup uninitialized variable warning]
>> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
>>
>> $ git describe --contains 656ecc16e8fc2ab44b3d70e3fcc197a7020d0ca5
>> v4.19-rc1~24^2~50
>>
>>
>> Which was never backported to any stable release, so probably needs to
>> be for v4.14 through v4.18.
>
> Yeah the P9 NX support went in in:
> b0d6c9bab5e4 ("crypto/nx: Add P9 NX support for 842 compression engine")
>
> Which was: v4.14-rc1~119^2~21, so first released in v4.14.
>
>
> I'm actually less interested in that and more interested in the
> subsequent crash. The time stamps are miles apart though, did we just
> leave some corrupted memory after the NX failed and then hit it later?
> Or did we not correctly signal to the upper level APIs that the request
> failed.
>
> I think we need to do some testing with errors injected into the
> wait_for_csb() path, to ensure that failures there are not causing
> corrupting in zswap. Haren have you done any testing of error
> injection?
So, things died pretty heavily overnight (requiring e2fsck) with a *lot*
of those wait_for_csb() errors in the log.
It certainly *looks* like there's corruption around, as one of the CI
jobs that failed around that time got "internal compiler error" which is
usually a good sign that things have gone poorly somewhere.
--
Stewart Smith
OPAL Architect, IBM.
next prev parent reply other threads:[~2019-06-06 2:31 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-06-04 3:23 crash after NX error Stewart Smith
2019-06-05 5:21 ` Haren Myneni
2019-06-05 11:06 ` Michael Ellerman
2019-06-06 2:29 ` Stewart Smith [this message]
2019-06-11 0:44 ` Haren Myneni
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87lfyfa1kd.fsf@linux.vnet.ibm.com \
--to=stewart@linux.ibm.com \
--cc=haren@linux.vnet.ibm.com \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=mpe@ellerman.id.au \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).