From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 7FE60E732E7 for ; Thu, 28 Sep 2023 16:05:31 +0000 (UTC) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1qltV6-0001fD-KZ; Thu, 28 Sep 2023 12:04:33 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qltV2-0001dJ-AG; Thu, 28 Sep 2023 12:04:24 -0400 Received: from frasgout.his.huawei.com ([185.176.79.56]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qltUz-0008WC-9X; Thu, 28 Sep 2023 12:04:23 -0400 Received: from lhrpeml500005.china.huawei.com (unknown [172.18.147.201]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4RxJBl2hjPz6HJmr; Fri, 29 Sep 2023 00:01:51 +0800 (CST) Received: from localhost (10.202.227.76) by lhrpeml500005.china.huawei.com (7.191.163.240) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.31; Thu, 28 Sep 2023 17:04:16 +0100 Date: Thu, 28 Sep 2023 17:04:15 +0100 To: Vikram Sethi CC: Alex Williamson , Jason Gunthorpe , Ankit Agrawal , David Hildenbrand , =?ISO-8859-1?Q?C=E9dric?= Le Goater , "shannon.zhaosl@gmail.com" , "peter.maydell@linaro.org" , "ani@anisinha.ca" , Aniket Agashe , Neo Jia , Kirti Wankhede , "Tarun Gupta (SW-GPU)" , Andy Currid , "qemu-arm@nongnu.org" , "qemu-devel@nongnu.org" , Gavin Shan , "ira.weiny@intel.com" , "navneet.singh@intel.com" , Dave Jiang Subject: Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory Message-ID: <20230928170415.00005190@Huawei.com> In-Reply-To: References: <20230915084754.4b49d5c0.alex.williamson@redhat.com> <769b577a-65b0-dbfe-3e99-db57cea08529@redhat.com> <20230926131427.1e441670.alex.williamson@redhat.com> <20230927123318.00005aad@Huawei.com> <20230927135336.GA339126@nvidia.com> <20230927082434.3583361c.alex.williamson@redhat.com> Organization: Huawei Technologies Research and Development (UK) Ltd. X-Mailer: Claws Mail 4.1.0 (GTK 3.24.33; x86_64-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: quoted-printable X-Originating-IP: [10.202.227.76] X-ClientProxiedBy: lhrpeml100001.china.huawei.com (7.191.160.183) To lhrpeml500005.china.huawei.com (7.191.163.240) X-CFilter-Loop: Reflected Received-SPF: pass client-ip=185.176.79.56; envelope-from=jonathan.cameron@huawei.com; helo=frasgout.his.huawei.com X-Spam_score_int: -41 X-Spam_score: -4.2 X-Spam_bar: ---- X-Spam_report: (-4.2 / 5.0 requ) BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_MSPIKE_H5=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-to: Jonathan Cameron From: Jonathan Cameron via Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org On Wed, 27 Sep 2023 15:03:09 +0000 Vikram Sethi wrote: > > From: Alex Williamson > > Sent: Wednesday, September 27, 2023 9:25 AM > > To: Jason Gunthorpe > > Cc: Jonathan Cameron ; Ankit Agrawal > > ; David Hildenbrand ; C=E9dric Le > > Goater ; shannon.zhaosl@gmail.com; > > peter.maydell@linaro.org; ani@anisinha.ca; Aniket Agashe > > ; Neo Jia ; Kirti Wankhede > > ; Tarun Gupta (SW-GPU) ; > > Vikram Sethi ; Andy Currid ; > > qemu-arm@nongnu.org; qemu-devel@nongnu.org; Gavin Shan > > ; ira.weiny@intel.com; navneet.singh@intel.com > > Subject: Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory > >=20 > >=20 > > On Wed, 27 Sep 2023 10:53:36 -0300 > > Jason Gunthorpe wrote: > > =20 > > > On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote: > > > =20 > > > > CXL accelerators / GPUs etc are a different question but who has one > > > > of those anyway? :) =20 > > > > > > That's exactly what I mean when I say CXL will need it too. I keep > > > describing this current Grace & Hopper as pre-CXL HW. You can easially > > > imagine draping CXL around it. CXL doesn't solve the problem that > > > motivates all this hackying - Linux can't dynamically create NUMA > > > nodes. =20 > >=20 > > Why is that and why aren't we pushing towards a solution of removing th= at > > barrier so that we don't require the machine topology to be configured = to > > support this use case and guest OS limitations? Thanks, > > =20 >=20 > Even if Linux could create NUMA nodes dynamically for coherent CXL or CXL= type devices,=20 > there is additional information FW knows that the kernel doesn't. For exa= mple, > what the distance/latency between CPU and the device NUMA node is. While = CXL devices > present a CDAT table which gives latency type attributes within the devic= e, there would still be some > guesswork needed in the kernel as to what the end to end latency/distance= is.=20 FWIW Shouldn't be guess work needed (for light load case anyway which is what wo= uld be in HMAT). That's what the Generic Ports were added to SRAT for. Dave Jiang = has a patch set https://lore.kernel.org/all/168695160531.3031571.4875512229068707023.stgit@= djiang5-mobl3/ to do the maths... For CXL there is no problem fully describing the access= characteristics. > It's probably not the best outcome to just consider this generically far = memory" because=20 > is it further than Intersocket memory access or not matters.=20 > Pre CXL devices such as for this patchset don't even have CDAT so the ker= nel by itself has > no idea if this latency/distance is less than or more than inter socket m= emory access latency > even. Just because I'm feeling cheeky - you could emulate a DOE and CDAT :)? Though I suppose you don't want to teach the guest driver about it. > So specially for devices present at boot, FW knows this information and s= hould provide it.=20 > Similarly, QEMU should pass along this information to VMs for the best ou= tcomes. No problem with the argument that FW has the info and should provide it, just on the 'how' part. Jonathan >=20 > Thanks > > Alex =20 >=20