From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id BABE2C433F5 for ; Mon, 10 Oct 2022 16:21:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229701AbiJJQVG (ORCPT ); Mon, 10 Oct 2022 12:21:06 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52710 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229446AbiJJQVD (ORCPT ); Mon, 10 Oct 2022 12:21:03 -0400 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E513838A1B for ; Mon, 10 Oct 2022 09:21:00 -0700 (PDT) Received: from fraeml701-chm.china.huawei.com (unknown [172.18.147.200]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4MmPJz5BkNz67qjY; Tue, 11 Oct 2022 00:20:19 +0800 (CST) Received: from lhrpeml500005.china.huawei.com (7.191.163.240) by fraeml701-chm.china.huawei.com (10.206.15.50) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P256) id 15.1.2375.31; Mon, 10 Oct 2022 18:20:58 +0200 Received: from localhost (10.202.226.42) by lhrpeml500005.china.huawei.com (7.191.163.240) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Mon, 10 Oct 2022 17:20:58 +0100 Date: Mon, 10 Oct 2022 17:20:57 +0100 From: Jonathan Cameron To: Jonathan Cameron via CC: Dan Williams , Bobo WL , , Subject: Re: [BUG] cxl can not create region Message-ID: <20221010172057.00001559@huawei.com> In-Reply-To: <20220819094655.000005ed@huawei.com> References: <62f132aee401b_1b3c294c@dwillia2-xfh.jf.intel.com.notmuch> <62f27a8e9acb6_1f18b294e2@dwillia2-xfh.jf.intel.com.notmuch> <62f5a2ffe8be0_3ce68294a4@dwillia2-xfh.jf.intel.com.notmuch> <20220817171619.000021ca@huawei.com> <20220818173740.00007f02@huawei.com> <20220819094655.000005ed@huawei.com> X-Mailer: Claws Mail 4.0.0 (GTK+ 3.24.29; i686-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.202.226.42] X-ClientProxiedBy: lhrpeml500003.china.huawei.com (7.191.162.67) To lhrpeml500005.china.huawei.com (7.191.163.240) X-CFilter-Loop: Reflected Precedence: bulk List-ID: X-Mailing-List: linux-cxl@vger.kernel.org On Fri, 19 Aug 2022 09:46:55 +0100 Jonathan Cameron wrote: > On Thu, 18 Aug 2022 17:37:40 +0100 > Jonathan Cameron via wrote: > > > On Wed, 17 Aug 2022 17:16:19 +0100 > > Jonathan Cameron wrote: > > > > > On Thu, 11 Aug 2022 17:46:55 -0700 > > > Dan Williams wrote: > > > > > > > Dan Williams wrote: > > > > > Bobo WL wrote: > > > > > > Hi Dan, > > > > > > > > > > > > Thanks for your reply! > > > > > > > > > > > > On Mon, Aug 8, 2022 at 11:58 PM Dan Williams wrote: > > > > > > > > > > > > > > What is the output of: > > > > > > > > > > > > > > cxl list -MDTu -d decoder0.0 > > > > > > > > > > > > > > ...? It might be the case that mem1 cannot be mapped by decoder0.0, or > > > > > > > at least not in the specified order, or that validation check is broken. > > > > > > > > > > > > Command "cxl list -MDTu -d decoder0.0" output: > > > > > > > > > > Thanks for this, I think I know the problem, but will try some > > > > > experiments with cxl_test first. > > > > > > > > Hmm, so my cxl_test experiment unfortunately passed so I'm not > > > > reproducing the failure mode. This is the result of creating x4 region > > > > with devices directly attached to a single host-bridge: > > > > > > > > # cxl create-region -d decoder3.5 -w 4 -m -g 256 mem{12,10,9,11} -s $((1<<30)) > > > > { > > > > "region":"region8", > > > > "resource":"0xf1f0000000", > > > > "size":"1024.00 MiB (1073.74 MB)", > > > > "interleave_ways":4, > > > > "interleave_granularity":256, > > > > "decode_state":"commit", > > > > "mappings":[ > > > > { > > > > "position":3, > > > > "memdev":"mem11", > > > > "decoder":"decoder21.0" > > > > }, > > > > { > > > > "position":2, > > > > "memdev":"mem9", > > > > "decoder":"decoder19.0" > > > > }, > > > > { > > > > "position":1, > > > > "memdev":"mem10", > > > > "decoder":"decoder20.0" > > > > }, > > > > { > > > > "position":0, > > > > "memdev":"mem12", > > > > "decoder":"decoder22.0" > > > > } > > > > ] > > > > } > > > > cxl region: cmd_create_region: created 1 region > > > > > > > > > Did the commit_store() crash stop reproducing with latest cxl/preview > > > > > branch? > > > > > > > > I missed the answer to this question. > > > > > > > > All of these changes are now in Linus' tree perhaps give that a try and > > > > post the debug log again? > > > > > > Hi Dan, > > > > > > I've moved onto looking at this one. > > > 1 HB, 2RP (to make it configure the HDM decoder in the QEMU HB, I'll tidy that up > > > at some stage), 1 switch, 4 downstream switch ports each with a type 3 > > > > > > I'm not getting a crash, but can't successfully setup a region. > > > Upon adding the final target > > > It's failing in check_last_peer() as pos < distance. > > > Seems distance is 4 which makes me think it's using the wrong level of the heirarchy for > > > some reason or that distance check is wrong. > > > Wasn't a good idea to just skip that step though as it goes boom - though > > > stack trace is not useful. > > > > Turns out really weird corruption happens if you accidentally back two type3 devices > > with the same memory device. Who would have thought it :) > > > > That aside ignoring the check_last_peer() failure seems to make everything work for this > > topology. I'm not seeing the crash, so my guess is we fixed it somewhere along the way. > > > > Now for the fun one. I've replicated the crash if we have > > > > 1HB 1*RP 1SW, 4SW-DSP, 4Type3 > > > > Now, I'd expect to see it not 'work' because the QEMU HDM decoder won't be programmed > > but the null pointer dereference isn't related to that. > > > > The bug is straight forward. Not all decoders have commit callbacks... Will send out > > a possible fix shortly. > > > For completeness I'm carrying this hack because I haven't gotten my head > around the right fix for check_last_peer() failing on this test topology. > > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c > index c49d9a5f1091..275e143bd748 100644 > --- a/drivers/cxl/core/region.c > +++ b/drivers/cxl/core/region.c > @@ -978,7 +978,7 @@ static int cxl_port_setup_targets(struct cxl_port *port, > rc = check_last_peer(cxled, ep, cxl_rr, > distance); > if (rc) > - return rc; > + // return rc; > goto out_target_set; > } > goto add_target; I'm still carrying this hack and still haven't worked out the right fix. Suggestions welcome! If not I'll hopefully get some time on this towards the end of the week. Jonathan From mboxrd@z Thu Jan 1 00:00:00 1970 Received: by 2002:ac9:4144:0:0:0:0:0 with SMTP id u4csp2551792ocg; Mon, 10 Oct 2022 09:21:32 -0700 (PDT) X-Google-Smtp-Source: AMsMyM7i7Q6rmbLEe9fmJ16lVnvGoytm2E/grJA8CLl9/DCNmL4hDyjoSKL65hoxDXSo+K9y5/yo X-Received: by 2002:ac8:594b:0:b0:35c:d0b7:e2f9 with SMTP id 11-20020ac8594b000000b0035cd0b7e2f9mr15632617qtz.483.1665418892486; Mon, 10 Oct 2022 09:21:32 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1665418892; cv=none; d=google.com; s=arc-20160816; b=X1Q8965+36Yu5jXlQ4IWDH/wTW1d/ivggv/bVXCXBYdtPIGWHLl2ir6Te6+KcZ3S06 U7sblo4WmPX4xNhdXMUUYvLjDPEw8A1tUoNGoDhBu4jBBKQDeb7uwa716RVlspQZCjOs Q2UPTggAEeawKd9rtZUCUr+nOz+s2l4/5H8BUJqqrICaO62N8jsU33DnoUpzpWxNUcPR UwS8/brjMPpm0/wIbGwEr+O1OYpmSKMDlp+1OifaLB9AJAwTrib/iIwJ6af2yxZC5kGu JrN7nsJvopo8NDEcpSSLnAq0Qcg1uOXom1cMzUg9rNNS1TeewZsv33vDkvMWeQRX3lPq XheQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=from:reply-to:sender:errors-to:list-subscribe:list-help:list-post :list-archive:list-unsubscribe:list-id:precedence :content-transfer-encoding:mime-version:references:in-reply-to :message-id:subject:cc:to:date; bh=JXdU8wH843Eb6bRU2Ecjy4XHa7ca1i3mC//ODYxLSEg=; b=ZLSi2lDsU9teyT6f2SnW1XjnVt4ue1Pwvh9RNHHT2srjMODH+75/EWuVC+7arKHBUC BpkvDMhp/sKpPbSahPBS8vXM4/POHIa9vkvC4UCYooJS59VoWQYfDA5s1eOWdCXpOR6m /GFyqgwV2Mj0oJ4o3z6htMH/9cSNafxx6VNr1YepLqPmX1ws9sn3jQxqGmv8EsSYRpW/ iM8VDNP0K2jjoTzNIrjTWfkidhlZm7oyIts2etwCfIbwW9FhADj6FNW5jS0uMu+SpWqL J56Rhwl3+OQVRYh1tjO9Bs+GwTrjg1ACkGkJXwwZY2xomc83uVAxT//RslxeEcts2/Wq 8mHg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of qemu-arm-bounces+alex.bennee=linaro.org@nongnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom="qemu-arm-bounces+alex.bennee=linaro.org@nongnu.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=nongnu.org Return-Path: Received: from lists.gnu.org (lists.gnu.org. [209.51.188.17]) by mx.google.com with ESMTPS id ay41-20020a05622a22a900b00397ab6a0413si4160206qtb.30.2022.10.10.09.21.32 for (version=TLS1_2 cipher=ECDHE-ECDSA-CHACHA20-POLY1305 bits=256/256); Mon, 10 Oct 2022 09:21:32 -0700 (PDT) Received-SPF: pass (google.com: domain of qemu-arm-bounces+alex.bennee=linaro.org@nongnu.org designates 209.51.188.17 as permitted sender) client-ip=209.51.188.17; Authentication-Results: mx.google.com; spf=pass (google.com: domain of qemu-arm-bounces+alex.bennee=linaro.org@nongnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom="qemu-arm-bounces+alex.bennee=linaro.org@nongnu.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=nongnu.org Received: from localhost ([::1]:34516 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1ohvX2-0007Kc-1I for alex.bennee@linaro.org; Mon, 10 Oct 2022 12:21:32 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:34978) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1ohvWf-0007IG-UZ; Mon, 10 Oct 2022 12:21:11 -0400 Received: from frasgout.his.huawei.com ([185.176.79.56]:2713) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1ohvWb-00083m-Da; Mon, 10 Oct 2022 12:21:09 -0400 Received: from fraeml701-chm.china.huawei.com (unknown [172.18.147.200]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4MmPJz5BkNz67qjY; Tue, 11 Oct 2022 00:20:19 +0800 (CST) Received: from lhrpeml500005.china.huawei.com (7.191.163.240) by fraeml701-chm.china.huawei.com (10.206.15.50) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P256) id 15.1.2375.31; Mon, 10 Oct 2022 18:20:58 +0200 Received: from localhost (10.202.226.42) by lhrpeml500005.china.huawei.com (7.191.163.240) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Mon, 10 Oct 2022 17:20:58 +0100 Date: Mon, 10 Oct 2022 17:20:57 +0100 To: Jonathan Cameron via CC: Dan Williams , Bobo WL , , Subject: Re: [BUG] cxl can not create region Message-ID: <20221010172057.00001559@huawei.com> In-Reply-To: <20220819094655.000005ed@huawei.com> References: <62f132aee401b_1b3c294c@dwillia2-xfh.jf.intel.com.notmuch> <62f27a8e9acb6_1f18b294e2@dwillia2-xfh.jf.intel.com.notmuch> <62f5a2ffe8be0_3ce68294a4@dwillia2-xfh.jf.intel.com.notmuch> <20220817171619.000021ca@huawei.com> <20220818173740.00007f02@huawei.com> <20220819094655.000005ed@huawei.com> X-Mailer: Claws Mail 4.0.0 (GTK+ 3.24.29; i686-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.202.226.42] X-ClientProxiedBy: lhrpeml500003.china.huawei.com (7.191.162.67) To lhrpeml500005.china.huawei.com (7.191.163.240) X-CFilter-Loop: Reflected Received-SPF: pass client-ip=185.176.79.56; envelope-from=jonathan.cameron@huawei.com; helo=frasgout.his.huawei.com X-Spam_score_int: -41 X-Spam_score: -4.2 X-Spam_bar: ---- X-Spam_report: (-4.2 / 5.0 requ) BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-arm@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-arm-bounces+alex.bennee=linaro.org@nongnu.org Sender: "Qemu-arm" Reply-to: Jonathan Cameron From: Jonathan Cameron via X-TUID: gUKOl9SAr/Rk On Fri, 19 Aug 2022 09:46:55 +0100 Jonathan Cameron wrote: > On Thu, 18 Aug 2022 17:37:40 +0100 > Jonathan Cameron via wrote: > > > On Wed, 17 Aug 2022 17:16:19 +0100 > > Jonathan Cameron wrote: > > > > > On Thu, 11 Aug 2022 17:46:55 -0700 > > > Dan Williams wrote: > > > > > > > Dan Williams wrote: > > > > > Bobo WL wrote: > > > > > > Hi Dan, > > > > > > > > > > > > Thanks for your reply! > > > > > > > > > > > > On Mon, Aug 8, 2022 at 11:58 PM Dan Williams wrote: > > > > > > > > > > > > > > What is the output of: > > > > > > > > > > > > > > cxl list -MDTu -d decoder0.0 > > > > > > > > > > > > > > ...? It might be the case that mem1 cannot be mapped by decoder0.0, or > > > > > > > at least not in the specified order, or that validation check is broken. > > > > > > > > > > > > Command "cxl list -MDTu -d decoder0.0" output: > > > > > > > > > > Thanks for this, I think I know the problem, but will try some > > > > > experiments with cxl_test first. > > > > > > > > Hmm, so my cxl_test experiment unfortunately passed so I'm not > > > > reproducing the failure mode. This is the result of creating x4 region > > > > with devices directly attached to a single host-bridge: > > > > > > > > # cxl create-region -d decoder3.5 -w 4 -m -g 256 mem{12,10,9,11} -s $((1<<30)) > > > > { > > > > "region":"region8", > > > > "resource":"0xf1f0000000", > > > > "size":"1024.00 MiB (1073.74 MB)", > > > > "interleave_ways":4, > > > > "interleave_granularity":256, > > > > "decode_state":"commit", > > > > "mappings":[ > > > > { > > > > "position":3, > > > > "memdev":"mem11", > > > > "decoder":"decoder21.0" > > > > }, > > > > { > > > > "position":2, > > > > "memdev":"mem9", > > > > "decoder":"decoder19.0" > > > > }, > > > > { > > > > "position":1, > > > > "memdev":"mem10", > > > > "decoder":"decoder20.0" > > > > }, > > > > { > > > > "position":0, > > > > "memdev":"mem12", > > > > "decoder":"decoder22.0" > > > > } > > > > ] > > > > } > > > > cxl region: cmd_create_region: created 1 region > > > > > > > > > Did the commit_store() crash stop reproducing with latest cxl/preview > > > > > branch? > > > > > > > > I missed the answer to this question. > > > > > > > > All of these changes are now in Linus' tree perhaps give that a try and > > > > post the debug log again? > > > > > > Hi Dan, > > > > > > I've moved onto looking at this one. > > > 1 HB, 2RP (to make it configure the HDM decoder in the QEMU HB, I'll tidy that up > > > at some stage), 1 switch, 4 downstream switch ports each with a type 3 > > > > > > I'm not getting a crash, but can't successfully setup a region. > > > Upon adding the final target > > > It's failing in check_last_peer() as pos < distance. > > > Seems distance is 4 which makes me think it's using the wrong level of the heirarchy for > > > some reason or that distance check is wrong. > > > Wasn't a good idea to just skip that step though as it goes boom - though > > > stack trace is not useful. > > > > Turns out really weird corruption happens if you accidentally back two type3 devices > > with the same memory device. Who would have thought it :) > > > > That aside ignoring the check_last_peer() failure seems to make everything work for this > > topology. I'm not seeing the crash, so my guess is we fixed it somewhere along the way. > > > > Now for the fun one. I've replicated the crash if we have > > > > 1HB 1*RP 1SW, 4SW-DSP, 4Type3 > > > > Now, I'd expect to see it not 'work' because the QEMU HDM decoder won't be programmed > > but the null pointer dereference isn't related to that. > > > > The bug is straight forward. Not all decoders have commit callbacks... Will send out > > a possible fix shortly. > > > For completeness I'm carrying this hack because I haven't gotten my head > around the right fix for check_last_peer() failing on this test topology. > > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c > index c49d9a5f1091..275e143bd748 100644 > --- a/drivers/cxl/core/region.c > +++ b/drivers/cxl/core/region.c > @@ -978,7 +978,7 @@ static int cxl_port_setup_targets(struct cxl_port *port, > rc = check_last_peer(cxled, ep, cxl_rr, > distance); > if (rc) > - return rc; > + // return rc; > goto out_target_set; > } > goto add_target; I'm still carrying this hack and still haven't worked out the right fix. Suggestions welcome! If not I'll hopefully get some time on this towards the end of the week. Jonathan From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id A8840C433F5 for ; Mon, 10 Oct 2022 16:24:10 +0000 (UTC) Received: from localhost ([::1]:57696 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1ohvZZ-0002eV-Mh for qemu-devel@archiver.kernel.org; Mon, 10 Oct 2022 12:24:09 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:34978) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1ohvWf-0007IG-UZ; Mon, 10 Oct 2022 12:21:11 -0400 Received: from frasgout.his.huawei.com ([185.176.79.56]:2713) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1ohvWb-00083m-Da; Mon, 10 Oct 2022 12:21:09 -0400 Received: from fraeml701-chm.china.huawei.com (unknown [172.18.147.200]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4MmPJz5BkNz67qjY; Tue, 11 Oct 2022 00:20:19 +0800 (CST) Received: from lhrpeml500005.china.huawei.com (7.191.163.240) by fraeml701-chm.china.huawei.com (10.206.15.50) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P256) id 15.1.2375.31; Mon, 10 Oct 2022 18:20:58 +0200 Received: from localhost (10.202.226.42) by lhrpeml500005.china.huawei.com (7.191.163.240) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Mon, 10 Oct 2022 17:20:58 +0100 Date: Mon, 10 Oct 2022 17:20:57 +0100 To: Jonathan Cameron via CC: Dan Williams , Bobo WL , , Subject: Re: [BUG] cxl can not create region Message-ID: <20221010172057.00001559@huawei.com> In-Reply-To: <20220819094655.000005ed@huawei.com> References: <62f132aee401b_1b3c294c@dwillia2-xfh.jf.intel.com.notmuch> <62f27a8e9acb6_1f18b294e2@dwillia2-xfh.jf.intel.com.notmuch> <62f5a2ffe8be0_3ce68294a4@dwillia2-xfh.jf.intel.com.notmuch> <20220817171619.000021ca@huawei.com> <20220818173740.00007f02@huawei.com> <20220819094655.000005ed@huawei.com> X-Mailer: Claws Mail 4.0.0 (GTK+ 3.24.29; i686-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.202.226.42] X-ClientProxiedBy: lhrpeml500003.china.huawei.com (7.191.162.67) To lhrpeml500005.china.huawei.com (7.191.163.240) X-CFilter-Loop: Reflected Received-SPF: pass client-ip=185.176.79.56; envelope-from=jonathan.cameron@huawei.com; helo=frasgout.his.huawei.com X-Spam_score_int: -41 X-Spam_score: -4.2 X-Spam_bar: ---- X-Spam_report: (-4.2 / 5.0 requ) BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" Reply-to: Jonathan Cameron From: Jonathan Cameron via On Fri, 19 Aug 2022 09:46:55 +0100 Jonathan Cameron wrote: > On Thu, 18 Aug 2022 17:37:40 +0100 > Jonathan Cameron via wrote: > > > On Wed, 17 Aug 2022 17:16:19 +0100 > > Jonathan Cameron wrote: > > > > > On Thu, 11 Aug 2022 17:46:55 -0700 > > > Dan Williams wrote: > > > > > > > Dan Williams wrote: > > > > > Bobo WL wrote: > > > > > > Hi Dan, > > > > > > > > > > > > Thanks for your reply! > > > > > > > > > > > > On Mon, Aug 8, 2022 at 11:58 PM Dan Williams wrote: > > > > > > > > > > > > > > What is the output of: > > > > > > > > > > > > > > cxl list -MDTu -d decoder0.0 > > > > > > > > > > > > > > ...? It might be the case that mem1 cannot be mapped by decoder0.0, or > > > > > > > at least not in the specified order, or that validation check is broken. > > > > > > > > > > > > Command "cxl list -MDTu -d decoder0.0" output: > > > > > > > > > > Thanks for this, I think I know the problem, but will try some > > > > > experiments with cxl_test first. > > > > > > > > Hmm, so my cxl_test experiment unfortunately passed so I'm not > > > > reproducing the failure mode. This is the result of creating x4 region > > > > with devices directly attached to a single host-bridge: > > > > > > > > # cxl create-region -d decoder3.5 -w 4 -m -g 256 mem{12,10,9,11} -s $((1<<30)) > > > > { > > > > "region":"region8", > > > > "resource":"0xf1f0000000", > > > > "size":"1024.00 MiB (1073.74 MB)", > > > > "interleave_ways":4, > > > > "interleave_granularity":256, > > > > "decode_state":"commit", > > > > "mappings":[ > > > > { > > > > "position":3, > > > > "memdev":"mem11", > > > > "decoder":"decoder21.0" > > > > }, > > > > { > > > > "position":2, > > > > "memdev":"mem9", > > > > "decoder":"decoder19.0" > > > > }, > > > > { > > > > "position":1, > > > > "memdev":"mem10", > > > > "decoder":"decoder20.0" > > > > }, > > > > { > > > > "position":0, > > > > "memdev":"mem12", > > > > "decoder":"decoder22.0" > > > > } > > > > ] > > > > } > > > > cxl region: cmd_create_region: created 1 region > > > > > > > > > Did the commit_store() crash stop reproducing with latest cxl/preview > > > > > branch? > > > > > > > > I missed the answer to this question. > > > > > > > > All of these changes are now in Linus' tree perhaps give that a try and > > > > post the debug log again? > > > > > > Hi Dan, > > > > > > I've moved onto looking at this one. > > > 1 HB, 2RP (to make it configure the HDM decoder in the QEMU HB, I'll tidy that up > > > at some stage), 1 switch, 4 downstream switch ports each with a type 3 > > > > > > I'm not getting a crash, but can't successfully setup a region. > > > Upon adding the final target > > > It's failing in check_last_peer() as pos < distance. > > > Seems distance is 4 which makes me think it's using the wrong level of the heirarchy for > > > some reason or that distance check is wrong. > > > Wasn't a good idea to just skip that step though as it goes boom - though > > > stack trace is not useful. > > > > Turns out really weird corruption happens if you accidentally back two type3 devices > > with the same memory device. Who would have thought it :) > > > > That aside ignoring the check_last_peer() failure seems to make everything work for this > > topology. I'm not seeing the crash, so my guess is we fixed it somewhere along the way. > > > > Now for the fun one. I've replicated the crash if we have > > > > 1HB 1*RP 1SW, 4SW-DSP, 4Type3 > > > > Now, I'd expect to see it not 'work' because the QEMU HDM decoder won't be programmed > > but the null pointer dereference isn't related to that. > > > > The bug is straight forward. Not all decoders have commit callbacks... Will send out > > a possible fix shortly. > > > For completeness I'm carrying this hack because I haven't gotten my head > around the right fix for check_last_peer() failing on this test topology. > > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c > index c49d9a5f1091..275e143bd748 100644 > --- a/drivers/cxl/core/region.c > +++ b/drivers/cxl/core/region.c > @@ -978,7 +978,7 @@ static int cxl_port_setup_targets(struct cxl_port *port, > rc = check_last_peer(cxled, ep, cxl_rr, > distance); > if (rc) > - return rc; > + // return rc; > goto out_target_set; > } > goto add_target; I'm still carrying this hack and still haven't worked out the right fix. Suggestions welcome! If not I'll hopefully get some time on this towards the end of the week. Jonathan