From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0523FC4321A for ; Tue, 11 Jun 2019 00:46:48 +0000 (UTC) Received: from lists.ozlabs.org (lists.ozlabs.org [203.11.71.2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 37784206BA for ; Tue, 11 Jun 2019 00:46:47 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 37784206BA Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.vnet.ibm.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=linuxppc-dev-bounces+linuxppc-dev=archiver.kernel.org@lists.ozlabs.org Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) by lists.ozlabs.org (Postfix) with ESMTP id 45NBCS5lP8zDqRC for ; Tue, 11 Jun 2019 10:46:44 +1000 (AEST) Authentication-Results: lists.ozlabs.org; spf=none (mailfrom) smtp.mailfrom=linux.vnet.ibm.com (client-ip=148.163.158.5; helo=mx0a-001b2d01.pphosted.com; envelope-from=haren@linux.vnet.ibm.com; receiver=) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 45NB9R5sq0zDqQH for ; Tue, 11 Jun 2019 10:44:59 +1000 (AEST) Received: from pps.filterd (m0098421.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id x5B0g25R063784 for ; Mon, 10 Jun 2019 20:44:55 -0400 Received: from e13.ny.us.ibm.com (e13.ny.us.ibm.com [129.33.205.203]) by mx0a-001b2d01.pphosted.com with ESMTP id 2t1y8hd3wc-1 (version=TLSv1.2 cipher=AES256-GCM-SHA384 bits=256 verify=NOT) for ; Mon, 10 Jun 2019 20:44:55 -0400 Received: from localhost by e13.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Tue, 11 Jun 2019 01:44:55 +0100 Received: from b01cxnp23033.gho.pok.ibm.com (9.57.198.28) by e13.ny.us.ibm.com (146.89.104.200) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; (version=TLSv1/SSLv3 cipher=AES256-GCM-SHA384 bits=256/256) Tue, 11 Jun 2019 01:44:53 +0100 Received: from b01ledav003.gho.pok.ibm.com (b01ledav003.gho.pok.ibm.com [9.57.199.108]) by b01cxnp23033.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id x5B0iqw135521002 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 11 Jun 2019 00:44:52 GMT Received: from b01ledav003.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 9534CB2064; Tue, 11 Jun 2019 00:44:52 +0000 (GMT) Received: from b01ledav003.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 2A3E8B205F; Tue, 11 Jun 2019 00:44:52 +0000 (GMT) Received: from [9.70.82.143] (unknown [9.70.82.143]) by b01ledav003.gho.pok.ibm.com (Postfix) with ESMTP; Tue, 11 Jun 2019 00:44:52 +0000 (GMT) Date: Mon, 10 Jun 2019 17:44:46 -0700 From: Haren Myneni User-Agent: Mozilla/5.0 (X11; Linux i686; rv:15.0) Gecko/20120827 Thunderbird/15.0 MIME-Version: 1.0 To: Michael Ellerman Subject: Re: crash after NX error References: <87pnnuav9d.fsf@linux.vnet.ibm.com> <87zhmwmgv7.fsf@concordia.ellerman.id.au> In-Reply-To: <87zhmwmgv7.fsf@concordia.ellerman.id.au> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-TM-AS-GCONF: 00 x-cbid: 19061100-0064-0000-0000-000003EC7229 X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00011244; HX=3.00000242; KW=3.00000007; PH=3.00000004; SC=3.00000286; SDB=6.01216189; UDB=6.00639440; IPR=6.00997288; MB=3.00027257; MTD=3.00000008; XFM=3.00000015; UTC=2019-06-11 00:44:54 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 19061100-0065-0000-0000-00003DD77D72 Message-Id: <5CFEF97E.1020109@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2019-06-10_10:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1906110003 X-BeenThere: linuxppc-dev@lists.ozlabs.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: linuxppc-dev , Stewart Smith Errors-To: linuxppc-dev-bounces+linuxppc-dev=archiver.kernel.org@lists.ozlabs.org Sender: "Linuxppc-dev" On 06/05/2019 04:06 AM, Michael Ellerman wrote: > Stewart Smith writes: >> On my two socket POWER9 system (powernv) with 842 zwap set up, I >> recently got a crash with the Ubuntu kernel (I haven't tried with >> upstream, and this is the first time the system has died like this, so >> I'm not sure how repeatable it is). >> >> [ 2.891463] zswap: loaded using pool 842-nx/zbud >> ... >> [15626.124646] nx_compress_powernv: ERROR: CSB still not valid after 5000000 us, giving up : 00 00 00 00 00000000 >> [16868.932913] Unable to handle kernel paging request for data at address 0x6655f67da816cdb8 >> [16868.933726] Faulting instruction address: 0xc000000000391600 >> >> >> cpu 0x68: Vector: 380 (Data Access Out of Range) at [c000001c9d98b9a0] >> pc: c000000000391600: kmem_cache_alloc+0x2e0/0x340 >> lr: c0000000003915ec: kmem_cache_alloc+0x2cc/0x340 >> sp: c000001c9d98bc20 >> msr: 900000000280b033 >> dar: 6655f67da816cdb8 >> current = 0xc000001ad43cb400 >> paca = 0xc00000000fac7800 softe: 0 irq_happened: 0x01 >> pid = 8319, comm = make >> Linux version 4.15.0-50-generic (buildd@bos02-ppc64el-006) (gcc version 7.3.0 (Ubuntu 7.3.0-16ubuntu3)) #54-Ubuntu SMP Mon May 6 18:55:18 UTC 2019 (Ubuntu 4.15.0-50.54-generic 4.15.18) >> >> 68:mon> t >> [c000001c9d98bc20] c0000000003914d4 kmem_cache_alloc+0x1b4/0x340 (unreliable) >> [c000001c9d98bc80] c0000000003b1e14 __khugepaged_enter+0x54/0x220 >> [c000001c9d98bcc0] c00000000010f0ec copy_process.isra.5.part.6+0xebc/0x1a10 >> [c000001c9d98bda0] c00000000010fe4c _do_fork+0xec/0x510 >> [c000001c9d98be30] c00000000000b584 ppc_clone+0x8/0xc >> --- Exception: c00 (System Call) at 00007afe9daf87f4 >> SP (7fffca606880) is in userspace >> >> So, it looks like there could be a problem in the error path, plausibly >> fixed by this patch: >> >> commit 656ecc16e8fc2ab44b3d70e3fcc197a7020d0ca5 >> Author: Haren Myneni >> Date: Wed Jun 13 00:32:40 2018 -0700 >> >> crypto/nx: Initialize 842 high and normal RxFIFO control registers >> >> NX increments readOffset by FIFO size in receive FIFO control register >> when CRB is read. But the index in RxFIFO has to match with the >> corresponding entry in FIFO maintained by VAS in kernel. Otherwise NX >> may be processing incorrect CRBs and can cause CRB timeout. >> >> VAS FIFO offset is 0 when the receive window is opened during >> initialization. When the module is reloaded or in kexec boot, readOffset >> in FIFO control register may not match with VAS entry. This patch adds >> nx_coproc_init OPAL call to reset readOffset and queued entries in FIFO >> control register for both high and normal FIFOs. >> >> Signed-off-by: Haren Myneni >> [mpe: Fixup uninitialized variable warning] >> Signed-off-by: Michael Ellerman >> >> $ git describe --contains 656ecc16e8fc2ab44b3d70e3fcc197a7020d0ca5 >> v4.19-rc1~24^2~50 >> >> >> Which was never backported to any stable release, so probably needs to >> be for v4.14 through v4.18. > > Yeah the P9 NX support went in in: > b0d6c9bab5e4 ("crypto/nx: Add P9 NX support for 842 compression engine") > > Which was: v4.14-rc1~119^2~21, so first released in v4.14. > > > I'm actually less interested in that and more interested in the > subsequent crash. The time stamps are miles apart though, did we just > leave some corrupted memory after the NX failed and then hit it later? > Or did we not correctly signal to the upper level APIs that the request > failed. > > I think we need to do some testing with errors injected into the > wait_for_csb() path, to ensure that failures there are not causing > corrupting in zswap. Haren have you done any testing of error injection? The code path returns error code from wait_for_csb() properly to upper level APIs. In the case of decompression case, upon failure the request will fall back to SW 842. If NX is involved in this crash, the compression request may be successful with invalid CRB (mismatch FIFO entries in NX and VAS). Then SW 842 may be decompressed invalid data which might cause corruption later when accessing it. I will try to reproduce the issue with 4.14 kernel, Thanks Haren > > cheers >