From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 21E3DC00140 for ; Mon, 15 Aug 2022 17:16:26 +0000 (UTC) Received: from localhost ([::1]:39260 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1oNdhR-0000Nx-1L for qemu-devel@archiver.kernel.org; Mon, 15 Aug 2022 13:16:25 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:43696) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1oNdfT-0007qU-1l; Mon, 15 Aug 2022 13:14:23 -0400 Received: from frasgout.his.huawei.com ([185.176.79.56]:2676) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1oNdfQ-0005uC-Fw; Mon, 15 Aug 2022 13:14:22 -0400 Received: from fraeml711-chm.china.huawei.com (unknown [172.18.147.207]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4M615Z25Tvz67p52; Tue, 16 Aug 2022 01:11:14 +0800 (CST) Received: from lhrpeml500005.china.huawei.com (7.191.163.240) by fraeml711-chm.china.huawei.com (10.206.15.60) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.24; Mon, 15 Aug 2022 19:14:13 +0200 Received: from localhost (10.122.247.231) by lhrpeml500005.china.huawei.com (7.191.163.240) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.24; Mon, 15 Aug 2022 18:14:13 +0100 Date: Mon, 15 Aug 2022 18:14:12 +0100 To: Jonathan Cameron via CC: Bobo WL , , Subject: Re: [BUG] cxl can not create region Message-ID: <20220815181412.000057c9@huawei.com> In-Reply-To: <20220815180444.00006bdf@huawei.com> References: <20220808133727.00001171@huawei.com> <20220809170825.00001b61@huawei.com> <20220811180857.00005e67@huawei.com> <20220812164403.00001654@huawei.com> <20220815180444.00006bdf@huawei.com> Organization: Huawei Technologies R&D (UK) Ltd. X-Mailer: Claws Mail 4.0.0 (GTK+ 3.24.29; x86_64-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.122.247.231] X-ClientProxiedBy: lhrpeml500002.china.huawei.com (7.191.160.78) To lhrpeml500005.china.huawei.com (7.191.163.240) X-CFilter-Loop: Reflected Received-SPF: pass client-ip=185.176.79.56; envelope-from=jonathan.cameron@huawei.com; helo=frasgout.his.huawei.com X-Spam_score_int: -41 X-Spam_score: -4.2 X-Spam_bar: ---- X-Spam_report: (-4.2 / 5.0 requ) BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" Reply-to: Jonathan Cameron From: Jonathan Cameron via On Mon, 15 Aug 2022 18:04:44 +0100 Jonathan Cameron wrote: > On Fri, 12 Aug 2022 16:44:03 +0100 > Jonathan Cameron wrote: > > > On Thu, 11 Aug 2022 18:08:57 +0100 > > Jonathan Cameron via wrote: > > > > > On Tue, 9 Aug 2022 17:08:25 +0100 > > > Jonathan Cameron wrote: > > > > > > > On Tue, 9 Aug 2022 21:07:06 +0800 > > > > Bobo WL wrote: > > > > > > > > > Hi Jonathan > > > > > > > > > > Thanks for your reply! > > > > > > > > > > On Mon, Aug 8, 2022 at 8:37 PM Jonathan Cameron > > > > > wrote: > > > > > > > > > > > > Probably not related to your problem, but there is a disconnect in QEMU / > > > > > > kernel assumptionsaround the presence of an HDM decoder when a HB only > > > > > > has a single root port. Spec allows it to be provided or not as an implementation choice. > > > > > > Kernel assumes it isn't provide. Qemu assumes it is. > > > > > > > > > > > > The temporary solution is to throw in a second root port on the HB and not > > > > > > connect anything to it. Longer term I may special case this so that the particular > > > > > > decoder defaults to pass through settings in QEMU if there is only one root port. > > > > > > > > > > > > > > > > You are right! After adding an extra HB in qemu, I can create a x1 > > > > > region successfully. > > > > > But have some errors in Nvdimm: > > > > > > > > > > [ 74.925838] Unknown online node for memory at 0x10000000000, assuming node 0 > > > > > [ 74.925846] Unknown target node for memory at 0x10000000000, assuming node 0 > > > > > [ 74.927470] nd_region region0: nmem0: is disabled, failing probe > > > > > > > > Ah. I've seen this one, but not chased it down yet. Was on my todo list to chase > > > > down. Once I reach this state I can verify the HDM Decode is correct which is what > > > > I've been using to test (Which wasn't true until earlier this week). > > > > I'm currently testing via devmem, more for historical reasons than because it makes > > > > that much sense anymore. > > > > > > *embarassed cough*. We haven't fully hooked the LSA up in qemu yet. > > > I'd forgotten that was still on the todo list. I don't think it will > > > be particularly hard to do and will take a look in next few days. > > > > > > Very very indirectly this error is causing a driver probe fail that means that > > > we hit a code path that has a rather odd looking check on NDD_LABELING. > > > Should not have gotten near that path though - hence the problem is actually > > > when we call cxl_pmem_get_config_data() and it returns an error because > > > we haven't fully connected up the command in QEMU. > > > > So a least one bug in QEMU. We were not supporting variable length payloads on mailbox > > inputs (but were on outputs). That hasn't mattered until we get to LSA writes. > > We just need to relax condition on the supplied length. > > > > diff --git a/hw/cxl/cxl-mailbox-utils.c b/hw/cxl/cxl-mailbox-utils.c > > index c352a935c4..fdda9529fe 100644 > > --- a/hw/cxl/cxl-mailbox-utils.c > > +++ b/hw/cxl/cxl-mailbox-utils.c > > @@ -510,7 +510,7 @@ void cxl_process_mailbox(CXLDeviceState *cxl_dstate) > > cxl_cmd = &cxl_cmd_set[set][cmd]; > > h = cxl_cmd->handler; > > if (h) { > > - if (len == cxl_cmd->in) { > > + if (len == cxl_cmd->in || !cxl_cmd->in) { > Fix is wrong as we use ~0 as the placeholder for variable payload, not 0. Cause of the error is a failure in GET_LSA. Reason, payload length is wrong in QEMU but was hidden previously by my wrong fix here. Probably still a good idea to inject an error in GET_LSA and chase down the refcount issue. diff --git a/hw/cxl/cxl-mailbox-utils.c b/hw/cxl/cxl-mailbox-utils.c index fdda9529fe..e8565fbd6e 100644 --- a/hw/cxl/cxl-mailbox-utils.c +++ b/hw/cxl/cxl-mailbox-utils.c @@ -489,7 +489,7 @@ static struct cxl_cmd cxl_cmd_set[256][256] = { cmd_identify_memory_device, 0, 0 }, [CCLS][GET_PARTITION_INFO] = { "CCLS_GET_PARTITION_INFO", cmd_ccls_get_partition_info, 0, 0 }, - [CCLS][GET_LSA] = { "CCLS_GET_LSA", cmd_ccls_get_lsa, 0, 0 }, + [CCLS][GET_LSA] = { "CCLS_GET_LSA", cmd_ccls_get_lsa, 8, 0 }, [CCLS][SET_LSA] = { "CCLS_SET_LSA", cmd_ccls_set_lsa, ~0, IMMEDIATE_CONFIG_CHANGE | IMMEDIATE_DATA_CHANGE }, [MEDIA_AND_POISON][GET_POISON_LIST] = { "MEDIA_AND_POISON_GET_POISON_LIST", @@ -510,12 +510,13 @@ void cxl_process_mailbox(CXLDeviceState *cxl_dstate) cxl_cmd = &cxl_cmd_set[set][cmd]; h = cxl_cmd->handler; if (h) { - if (len == cxl_cmd->in || !cxl_cmd->in) { + if (len == cxl_cmd->in || cxl_cmd->in == ~0) { cxl_cmd->payload = cxl_dstate->mbox_reg_state + A_CXL_DEV_CMD_PAYLOAD; And woot, we get a namespace in the LSA :) I'll post QEMU fixes in next day or two. Kernel side now seems more or less fine be it with suspicious refcount underflow. > > With that fixed we hit new fun paths - after some errors we get the > worrying - not totally sure but looks like a failure on an error cleanup. > I'll chase down the error source, but even then this is probably triggerable by > hardware problem or similar. Some bonus prints in here from me chasing > error paths, but it's otherwise just cxl/next + the fix I posted earlier today. > > [ 69.919877] nd_bus ndbus0: START: nd_region.probe(region0) > [ 69.920108] nd_region_probe > [ 69.920623] ------------[ cut here ]------------ > [ 69.920675] refcount_t: addition on 0; use-after-free. > [ 69.921314] WARNING: CPU: 3 PID: 710 at lib/refcount.c:25 refcount_warn_saturate+0xa0/0x144 > [ 69.926949] Modules linked in: cxl_pmem cxl_mem cxl_pci cxl_port cxl_acpi cxl_core > [ 69.928830] CPU: 3 PID: 710 Comm: kworker/u8:9 Not tainted 5.19.0-rc3+ #399 > [ 69.930596] Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 > [ 69.931482] Workqueue: events_unbound async_run_entry_fn > [ 69.932403] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) > [ 69.934023] pc : refcount_warn_saturate+0xa0/0x144 > [ 69.935161] lr : refcount_warn_saturate+0xa0/0x144 > [ 69.936541] sp : ffff80000890b960 > [ 69.937921] x29: ffff80000890b960 x28: 0000000000000000 x27: 0000000000000000 > [ 69.940917] x26: ffffa54a90d5cb10 x25: ffffa54a90809e98 x24: 0000000000000000 > [ 69.942537] x23: ffffa54a91a3d8d8 x22: ffff0000c5254800 x21: ffff0000c5254800 > [ 69.944013] x20: ffff0000ce924180 x19: ffff0000c5254800 x18: ffffffffffffffff > [ 69.946100] x17: ffff5ab66e5ef000 x16: ffff80000801c000 x15: 0000000000000000 > [ 69.947585] x14: 0000000000000001 x13: 0a2e656572662d72 x12: 657466612d657375 > [ 69.948670] x11: 203b30206e6f206e x10: 6f69746964646120 x9 : ffffa54a8f63d288 > [ 69.950679] x8 : 206e6f206e6f6974 x7 : 69646461203a745f x6 : 00000000fffff31e > [ 69.952113] x5 : ffff0000ff61ba08 x4 : 00000000fffff31e x3 : ffff5ab66e5ef000 > root@debian:/sys/bus/cxl/devices/decoder0.0/region0# [ 69.954752] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0000c512e740 > [ 69.957098] Call trace: > [ 69.957959] refcount_warn_saturate+0xa0/0x144 > [ 69.958773] get_ndd+0x5c/0x80 > [ 69.959294] nd_region_register_namespaces+0xe4/0xe90 > [ 69.960253] nd_region_probe+0x100/0x290 > [ 69.960796] nvdimm_bus_probe+0xf4/0x1c0 > [ 69.962087] really_probe+0x19c/0x3f0 > [ 69.962620] __driver_probe_device+0x11c/0x190 > [ 69.963258] driver_probe_device+0x44/0xf4 > [ 69.963773] __device_attach_driver+0xa4/0x140 > [ 69.964471] bus_for_each_drv+0x84/0xe0 > [ 69.965068] __device_attach+0xb0/0x1f0 > [ 69.966101] device_initial_probe+0x20/0x30 > [ 69.967142] bus_probe_device+0xa4/0xb0 > [ 69.968104] device_add+0x3e8/0x910 > [ 69.969111] nd_async_device_register+0x24/0x74 > [ 69.969928] async_run_entry_fn+0x40/0x150 > [ 69.970725] process_one_work+0x1dc/0x450 > [ 69.971796] worker_thread+0x154/0x450 > [ 69.972700] kthread+0x118/0x120 > [ 69.974141] ret_from_fork+0x10/0x20 > [ 69.975141] ---[ end trace 0000000000000000 ]--- > [ 70.117887] Into nd_namespace_pmem_set_resource() > > > cxl_cmd->payload = cxl_dstate->mbox_reg_state + > > A_CXL_DEV_CMD_PAYLOAD; > > ret = (*h)(cxl_cmd, cxl_dstate, &len); > > > > > > This lets the nvdimm/region probe fine, but I'm getting some issues with > > namespace capacity so I'll look at what is causing that next. > > Unfortunately I'm not that familiar with the driver/nvdimm side of things > > so it's take a while to figure out what kicks off what! > > > > Jonathan > > > > > > > > Jonathan > > > > > > > > > > > > > > > > > > > > And x4 region still failed with same errors, using latest cxl/preview > > > > > branch don't work. > > > > > I have picked "Two CXL emulation fixes" patches in qemu, still not working. > > > > > > > > > > Bob > > > > > > > > >