From mboxrd@z Thu Jan 1 00:00:00 1970 Received: by 2002:a17:907:8744:b0:9bd:85f7:2662 with SMTP id qo4csp661440ejc; Thu, 12 Oct 2023 02:00:16 -0700 (PDT) X-Google-Smtp-Source: AGHT+IF20ThXvzLPt+lRxakq+2ZwtV8CW6j3knumM1+ZbiLsDnqINb8jX+eoqp0SdWHgh3WPMlGR X-Received: by 2002:ac8:5b8e:0:b0:418:1edd:d2ed with SMTP id a14-20020ac85b8e000000b004181eddd2edmr27963032qta.4.1697101216677; Thu, 12 Oct 2023 02:00:16 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1697101216; cv=none; d=google.com; s=arc-20160816; b=xxMLHkWV27bwBjZf7Jx3ZSyxI/6+ZH9RnYfz1xLmnETgIK8+qwrVWihAS5yRm6rYJg dJzygCA6rA4zR0UKKvoEl8Gg7wbrmnHFuKNYV67fJ1oe9mKm6sEo3Cr2L7skRhZKYeBw A0QsjO46Xi6tA0ZoMaW7csZeujYGUQEzUUyxetuGJPv7bJyla0aaRWqw9xbav5/Km6gh lAy69Ies52hWyBh16+gEGbLOBfL0jjVhuq4Xr3j4J5TPMpiWGKAATsBzi2IYCR17mU+j We5Qj3pYXD28MyQ2Znm7p5oEAehFsl/OxlMcnaNit+D6XMmSmTApzKd/Irqg1wcMVDP/ wBgw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:from:reply-to:list-subscribe:list-help:list-post :list-archive:list-unsubscribe:list-id:precedence :content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:date; bh=ZXdVZVIoSrWyYgN1rQFy1D3pewUnmoJ4jDhI3UQOa7k=; fh=o1ofa2/fMciLt+MQAh4VP73IHVEleZc4Qx82OkKPHmQ=; b=QVVNjmfHC8G5jaOE2dhOX02NvBVxNvh0QYkLUK1hbNT2BKYZc0FBf247U3CCxWKPMs hwDh1mo7PTfRvBjX+zk5/vDJvlqttwmLXkhFw6UBl/e2L77+mVgLLO2tQp2QVa67R2ob go2dVRkQBq3Wsy+a35maak00u4Fp94ZVyA2v8rOy/MmFe/fOqtsika+KaT9/FJIj1TAA Z3m1gvpU0mkEHl8lK5+65to0Dz3NApux/bWwhjqqYjgGzhGHYuN1Aiodgageeys/HaCY Lhg4NbK05p+hOZ2k84GBCJuMumQerJjQ/FhKWpW6P23MrJs1i4PERE3U3/qFofWbq4kC jVLA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of qemu-arm-bounces+alex.bennee=linaro.org@nongnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom="qemu-arm-bounces+alex.bennee=linaro.org@nongnu.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=nongnu.org Return-Path: Received: from lists.gnu.org (lists.gnu.org. [209.51.188.17]) by mx.google.com with ESMTPS id p18-20020a05622a00d200b0041996c8044dsi10579201qtw.307.2023.10.12.02.00.16 for (version=TLS1_2 cipher=ECDHE-ECDSA-CHACHA20-POLY1305 bits=256/256); Thu, 12 Oct 2023 02:00:16 -0700 (PDT) Received-SPF: pass (google.com: domain of qemu-arm-bounces+alex.bennee=linaro.org@nongnu.org designates 209.51.188.17 as permitted sender) client-ip=209.51.188.17; Authentication-Results: mx.google.com; spf=pass (google.com: domain of qemu-arm-bounces+alex.bennee=linaro.org@nongnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom="qemu-arm-bounces+alex.bennee=linaro.org@nongnu.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=nongnu.org Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1qqrY8-0007jz-Ct; Thu, 12 Oct 2023 05:00:08 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qqrY6-0007jE-6k; Thu, 12 Oct 2023 05:00:06 -0400 Received: from frasgout.his.huawei.com ([185.176.79.56]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qqrY0-0007YW-QP; Thu, 12 Oct 2023 05:00:04 -0400 Received: from lhrpeml500005.china.huawei.com (unknown [172.18.147.200]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4S5k743qdfz6K6XZ; Thu, 12 Oct 2023 16:57:52 +0800 (CST) Received: from localhost (10.48.155.47) by lhrpeml500005.china.huawei.com (7.191.163.240) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.31; Thu, 12 Oct 2023 09:59:55 +0100 Date: Thu, 12 Oct 2023 09:59:54 +0100 To: Vikram Sethi CC: Ankit Agrawal , Jason Gunthorpe , "alex.williamson@redhat.com" , "clg@redhat.com" , "shannon.zhaosl@gmail.com" , "peter.maydell@linaro.org" , "ani@anisinha.ca" , "berrange@redhat.com" , "eduardo@habkost.net" , "imammedo@redhat.com" , "mst@redhat.com" , "eblake@redhat.com" , "armbru@redhat.com" , "david@redhat.com" , "gshan@redhat.com" , Aniket Agashe , Neo Jia , Kirti Wankhede , "Tarun Gupta (SW-GPU)" , Andy Currid , Dheeraj Nigam , Uday Dhoke , "qemu-arm@nongnu.org" , "qemu-devel@nongnu.org" , Dave Jiang , "Shanker Donthineni" Subject: Re: [PATCH v2 1/3] qom: new object to associate device to numa node Message-ID: <20231012095954.00006ebb@Huawei.com> In-Reply-To: References: <20231007201740.30335-1-ankita@nvidia.com> <20231007201740.30335-2-ankita@nvidia.com> <20231009132642.00002c8d@Huawei.com> Organization: Huawei Technologies Research and Development (UK) Ltd. X-Mailer: Claws Mail 4.1.0 (GTK 3.24.33; x86_64-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.48.155.47] X-ClientProxiedBy: lhrpeml500001.china.huawei.com (7.191.163.213) To lhrpeml500005.china.huawei.com (7.191.163.240) X-CFilter-Loop: Reflected Received-SPF: pass client-ip=185.176.79.56; envelope-from=jonathan.cameron@huawei.com; helo=frasgout.his.huawei.com X-Spam_score_int: -41 X-Spam_score: -4.2 X-Spam_bar: ---- X-Spam_report: (-4.2 / 5.0 requ) BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_MSPIKE_H5=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-arm@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-to: Jonathan Cameron From: Jonathan Cameron via Errors-To: qemu-arm-bounces+alex.bennee=linaro.org@nongnu.org Sender: qemu-arm-bounces+alex.bennee=linaro.org@nongnu.org X-TUID: FQm+ScEVBf9m On Wed, 11 Oct 2023 17:37:11 +0000 Vikram Sethi wrote: > Hi Jonathan, > > > -----Original Message----- > > From: Jonathan Cameron > > Sent: Monday, October 9, 2023 7:27 AM > > To: Ankit Agrawal > > Cc: Jason Gunthorpe ; alex.williamson@redhat.com; > > clg@redhat.com; shannon.zhaosl@gmail.com; peter.maydell@linaro.org; > > ani@anisinha.ca; berrange@redhat.com; eduardo@habkost.net; > > imammedo@redhat.com; mst@redhat.com; eblake@redhat.com; > > armbru@redhat.com; david@redhat.com; gshan@redhat.com; Aniket > > Agashe ; Neo Jia ; Kirti Wankhede > > ; Tarun Gupta (SW-GPU) ; > > Vikram Sethi ; Andy Currid ; > > Dheeraj Nigam ; Uday Dhoke ; > > qemu-arm@nongnu.org; qemu-devel@nongnu.org; Dave Jiang > > > > Subject: Re: [PATCH v2 1/3] qom: new object to associate device to numa > > node > > > > > > On Sun, 8 Oct 2023 01:47:38 +0530 > > wrote: > > > > > From: Ankit Agrawal > > > > > > The CPU cache coherent device memory can be added as NUMA nodes > > > distinct from the system memory nodes. These nodes are associated with > > > the device and Qemu needs a way to maintain this link. > > > > Hi Ankit, > > > > I'm not sure I'm convinced of the approach to creating nodes for memory > > usage (or whether that works in Linux on all NUMA ACPI archs), but I am > > keen to see Generic Initiator support in QEMU. I'd also like to see it done in a > > way that naturally extends to Generic Ports which are very similar (but don't > > hang memory off them! :) Dave Jiang posted a PoC a while back for generic > > ports. > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore. > > kernel.org%2Fqemu- > > devel%2F168185633821.899932.322047053764766056.stgit%40djiang5- > > mobl3%2F&data=05%7C01%7Cvsethi%40nvidia.com%7C846b19f87bc5424d > > c33608dbc8c3015d%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7 > > C638324512146712954%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjA > > wMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C% > > 7C%7C&sdata=v318MXognoITHyv7AFqZAfvUi2hLy2ZUVnLvyQ2IAfY%3D&res > > erved=0 > > > > My concern with this approach is that it is using a side effect of a Linux > > implementation detail that the infra structure to bring up coherent memory > > is all present even for a GI only node (if it is which I can't recall) I'm also fairly > > sure we never tidied up the detail of going from the GI to the device in Linux > > (because it's harder than a _PXM entry for the device). It requires stashing a > > better description than the BDF before potentially doing reenumeration so > > that we can rebuild the association after that is done. > > > > I'm not sure I understood the concern. Are you suggesting that the ACPI specification > somehow prohibits memory associated with a GI node in the same PXM? i.e whether the GI is memory-less > or with memory isn't mandated by the spec IIRC. Certainly seems perfectly normal > for an accelerator with memory to have a GI with memory and that memory be able to be associated with the same PXM. Indeed reasonable that a GI would have associated memory, but if it's "normal memory" (i.e. coherent and not device private memory accessed by PCI bar etc) then expectation would be that memory is in SRAT as a memory entry. Which brings us back to the original question of whether 0 sized memory nodes are fine. > So what about this patch is using a Linux implementation detail? Even if Linux wasn't currently supporting > that use case, it is something that would have been reasonable to add IMO. What am I missing? Linux is careful to only bring up the infrastructure for specific types of roximity node. It works its way through SRAT and sets appropriate bitmap bits to say which combination of PXM node types a given node is. (CPU, Memory, GI etc) After that walk is done it then brings up various infrastructure. What I can't remember (you'll need to experiment) is if there is anything not brought up for a non Memory node that you would need. Might be fine, but that doesn't mean it will remain fine. Maybe we just need to make sure the documentation / comments in Linux cover this use case. You are on your own for what other OSes decided is valid here as the specifcation does not mention this AFAIK. If it does then add a reference. There is a non trivial (potential) cost to enabling facilities on NUMA nodes that will never make use of them - a bunch of longer searches etc when looking for memory. For GIs we enable pretty much everything a CPU node uses. That was controversial though only well after support was already in - the controversy being that it added costs to paths that didn't care about GIs. Basically it boils down to using unexpected corners of specifications may prove fragile. For one thing I'm doubtful if the NUMA description the kernel exposes (coming from a subset of HMAT) won't deal with this case. Not tried it though so you may be lucky. Jonathan From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 783D9CDB47E for ; Thu, 12 Oct 2023 09:00:41 +0000 (UTC) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1qqrYA-0007jy-0D; Thu, 12 Oct 2023 05:00:10 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qqrY6-0007jE-6k; Thu, 12 Oct 2023 05:00:06 -0400 Received: from frasgout.his.huawei.com ([185.176.79.56]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qqrY0-0007YW-QP; Thu, 12 Oct 2023 05:00:04 -0400 Received: from lhrpeml500005.china.huawei.com (unknown [172.18.147.200]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4S5k743qdfz6K6XZ; Thu, 12 Oct 2023 16:57:52 +0800 (CST) Received: from localhost (10.48.155.47) by lhrpeml500005.china.huawei.com (7.191.163.240) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.31; Thu, 12 Oct 2023 09:59:55 +0100 Date: Thu, 12 Oct 2023 09:59:54 +0100 To: Vikram Sethi CC: Ankit Agrawal , Jason Gunthorpe , "alex.williamson@redhat.com" , "clg@redhat.com" , "shannon.zhaosl@gmail.com" , "peter.maydell@linaro.org" , "ani@anisinha.ca" , "berrange@redhat.com" , "eduardo@habkost.net" , "imammedo@redhat.com" , "mst@redhat.com" , "eblake@redhat.com" , "armbru@redhat.com" , "david@redhat.com" , "gshan@redhat.com" , Aniket Agashe , Neo Jia , Kirti Wankhede , "Tarun Gupta (SW-GPU)" , Andy Currid , Dheeraj Nigam , Uday Dhoke , "qemu-arm@nongnu.org" , "qemu-devel@nongnu.org" , Dave Jiang , "Shanker Donthineni" Subject: Re: [PATCH v2 1/3] qom: new object to associate device to numa node Message-ID: <20231012095954.00006ebb@Huawei.com> In-Reply-To: References: <20231007201740.30335-1-ankita@nvidia.com> <20231007201740.30335-2-ankita@nvidia.com> <20231009132642.00002c8d@Huawei.com> Organization: Huawei Technologies Research and Development (UK) Ltd. X-Mailer: Claws Mail 4.1.0 (GTK 3.24.33; x86_64-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.48.155.47] X-ClientProxiedBy: lhrpeml500001.china.huawei.com (7.191.163.213) To lhrpeml500005.china.huawei.com (7.191.163.240) X-CFilter-Loop: Reflected Received-SPF: pass client-ip=185.176.79.56; envelope-from=jonathan.cameron@huawei.com; helo=frasgout.his.huawei.com X-Spam_score_int: -41 X-Spam_score: -4.2 X-Spam_bar: ---- X-Spam_report: (-4.2 / 5.0 requ) BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_MSPIKE_H5=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-to: Jonathan Cameron From: Jonathan Cameron via Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org On Wed, 11 Oct 2023 17:37:11 +0000 Vikram Sethi wrote: > Hi Jonathan, > > > -----Original Message----- > > From: Jonathan Cameron > > Sent: Monday, October 9, 2023 7:27 AM > > To: Ankit Agrawal > > Cc: Jason Gunthorpe ; alex.williamson@redhat.com; > > clg@redhat.com; shannon.zhaosl@gmail.com; peter.maydell@linaro.org; > > ani@anisinha.ca; berrange@redhat.com; eduardo@habkost.net; > > imammedo@redhat.com; mst@redhat.com; eblake@redhat.com; > > armbru@redhat.com; david@redhat.com; gshan@redhat.com; Aniket > > Agashe ; Neo Jia ; Kirti Wankhede > > ; Tarun Gupta (SW-GPU) ; > > Vikram Sethi ; Andy Currid ; > > Dheeraj Nigam ; Uday Dhoke ; > > qemu-arm@nongnu.org; qemu-devel@nongnu.org; Dave Jiang > > > > Subject: Re: [PATCH v2 1/3] qom: new object to associate device to numa > > node > > > > > > On Sun, 8 Oct 2023 01:47:38 +0530 > > wrote: > > > > > From: Ankit Agrawal > > > > > > The CPU cache coherent device memory can be added as NUMA nodes > > > distinct from the system memory nodes. These nodes are associated with > > > the device and Qemu needs a way to maintain this link. > > > > Hi Ankit, > > > > I'm not sure I'm convinced of the approach to creating nodes for memory > > usage (or whether that works in Linux on all NUMA ACPI archs), but I am > > keen to see Generic Initiator support in QEMU. I'd also like to see it done in a > > way that naturally extends to Generic Ports which are very similar (but don't > > hang memory off them! :) Dave Jiang posted a PoC a while back for generic > > ports. > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore. > > kernel.org%2Fqemu- > > devel%2F168185633821.899932.322047053764766056.stgit%40djiang5- > > mobl3%2F&data=05%7C01%7Cvsethi%40nvidia.com%7C846b19f87bc5424d > > c33608dbc8c3015d%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7 > > C638324512146712954%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjA > > wMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C% > > 7C%7C&sdata=v318MXognoITHyv7AFqZAfvUi2hLy2ZUVnLvyQ2IAfY%3D&res > > erved=0 > > > > My concern with this approach is that it is using a side effect of a Linux > > implementation detail that the infra structure to bring up coherent memory > > is all present even for a GI only node (if it is which I can't recall) I'm also fairly > > sure we never tidied up the detail of going from the GI to the device in Linux > > (because it's harder than a _PXM entry for the device). It requires stashing a > > better description than the BDF before potentially doing reenumeration so > > that we can rebuild the association after that is done. > > > > I'm not sure I understood the concern. Are you suggesting that the ACPI specification > somehow prohibits memory associated with a GI node in the same PXM? i.e whether the GI is memory-less > or with memory isn't mandated by the spec IIRC. Certainly seems perfectly normal > for an accelerator with memory to have a GI with memory and that memory be able to be associated with the same PXM. Indeed reasonable that a GI would have associated memory, but if it's "normal memory" (i.e. coherent and not device private memory accessed by PCI bar etc) then expectation would be that memory is in SRAT as a memory entry. Which brings us back to the original question of whether 0 sized memory nodes are fine. > So what about this patch is using a Linux implementation detail? Even if Linux wasn't currently supporting > that use case, it is something that would have been reasonable to add IMO. What am I missing? Linux is careful to only bring up the infrastructure for specific types of roximity node. It works its way through SRAT and sets appropriate bitmap bits to say which combination of PXM node types a given node is. (CPU, Memory, GI etc) After that walk is done it then brings up various infrastructure. What I can't remember (you'll need to experiment) is if there is anything not brought up for a non Memory node that you would need. Might be fine, but that doesn't mean it will remain fine. Maybe we just need to make sure the documentation / comments in Linux cover this use case. You are on your own for what other OSes decided is valid here as the specifcation does not mention this AFAIK. If it does then add a reference. There is a non trivial (potential) cost to enabling facilities on NUMA nodes that will never make use of them - a bunch of longer searches etc when looking for memory. For GIs we enable pretty much everything a CPU node uses. That was controversial though only well after support was already in - the controversy being that it added costs to paths that didn't care about GIs. Basically it boils down to using unexpected corners of specifications may prove fragile. For one thing I'm doubtful if the NUMA description the kernel exposes (coming from a subset of HMAT) won't deal with this case. Not tried it though so you may be lucky. Jonathan From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from list by lists.gnu.org with archive (Exim 4.90_1) id 1qqrY7-0007jb-Vc for mharc-qemu-devel@gnu.org; Thu, 12 Oct 2023 05:00:08 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qqrY6-0007jE-6k; Thu, 12 Oct 2023 05:00:06 -0400 Received: from frasgout.his.huawei.com ([185.176.79.56]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qqrY0-0007YW-QP; Thu, 12 Oct 2023 05:00:04 -0400 Received: from lhrpeml500005.china.huawei.com (unknown [172.18.147.200]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4S5k743qdfz6K6XZ; Thu, 12 Oct 2023 16:57:52 +0800 (CST) Received: from localhost (10.48.155.47) by lhrpeml500005.china.huawei.com (7.191.163.240) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.31; Thu, 12 Oct 2023 09:59:55 +0100 Date: Thu, 12 Oct 2023 09:59:54 +0100 From: Jonathan Cameron To: Vikram Sethi CC: Ankit Agrawal , Jason Gunthorpe , "alex.williamson@redhat.com" , "clg@redhat.com" , "shannon.zhaosl@gmail.com" , "peter.maydell@linaro.org" , "ani@anisinha.ca" , "berrange@redhat.com" , "eduardo@habkost.net" , "imammedo@redhat.com" , "mst@redhat.com" , "eblake@redhat.com" , "armbru@redhat.com" , "david@redhat.com" , "gshan@redhat.com" , Aniket Agashe , Neo Jia , Kirti Wankhede , "Tarun Gupta (SW-GPU)" , Andy Currid , Dheeraj Nigam , Uday Dhoke , "qemu-arm@nongnu.org" , "qemu-devel@nongnu.org" , Dave Jiang , "Shanker Donthineni" Subject: Re: [PATCH v2 1/3] qom: new object to associate device to numa node Message-ID: <20231012095954.00006ebb@Huawei.com> In-Reply-To: References: <20231007201740.30335-1-ankita@nvidia.com> <20231007201740.30335-2-ankita@nvidia.com> <20231009132642.00002c8d@Huawei.com> Organization: Huawei Technologies Research and Development (UK) Ltd. X-Mailer: Claws Mail 4.1.0 (GTK 3.24.33; x86_64-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.48.155.47] X-ClientProxiedBy: lhrpeml500001.china.huawei.com (7.191.163.213) To lhrpeml500005.china.huawei.com (7.191.163.240) X-CFilter-Loop: Reflected Received-SPF: pass client-ip=185.176.79.56; envelope-from=jonathan.cameron@huawei.com; helo=frasgout.his.huawei.com X-Spam_score_int: -41 X-Spam_score: -4.2 X-Spam_bar: ---- X-Spam_report: (-4.2 / 5.0 requ) BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_MSPIKE_H5=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 12 Oct 2023 09:00:07 -0000 Message-ID: <20231012085954.JY0yEdqqQElrg5f2LXy4jLDnQVotq5UowZhCYasbwYA@z> On Wed, 11 Oct 2023 17:37:11 +0000 Vikram Sethi wrote: > Hi Jonathan, > > > -----Original Message----- > > From: Jonathan Cameron > > Sent: Monday, October 9, 2023 7:27 AM > > To: Ankit Agrawal > > Cc: Jason Gunthorpe ; alex.williamson@redhat.com; > > clg@redhat.com; shannon.zhaosl@gmail.com; peter.maydell@linaro.org; > > ani@anisinha.ca; berrange@redhat.com; eduardo@habkost.net; > > imammedo@redhat.com; mst@redhat.com; eblake@redhat.com; > > armbru@redhat.com; david@redhat.com; gshan@redhat.com; Aniket > > Agashe ; Neo Jia ; Kirti Wankhede > > ; Tarun Gupta (SW-GPU) ; > > Vikram Sethi ; Andy Currid ; > > Dheeraj Nigam ; Uday Dhoke ; > > qemu-arm@nongnu.org; qemu-devel@nongnu.org; Dave Jiang > > > > Subject: Re: [PATCH v2 1/3] qom: new object to associate device to numa > > node > > > > > > On Sun, 8 Oct 2023 01:47:38 +0530 > > wrote: > > > > > From: Ankit Agrawal > > > > > > The CPU cache coherent device memory can be added as NUMA nodes > > > distinct from the system memory nodes. These nodes are associated with > > > the device and Qemu needs a way to maintain this link. > > > > Hi Ankit, > > > > I'm not sure I'm convinced of the approach to creating nodes for memory > > usage (or whether that works in Linux on all NUMA ACPI archs), but I am > > keen to see Generic Initiator support in QEMU. I'd also like to see it done in a > > way that naturally extends to Generic Ports which are very similar (but don't > > hang memory off them! :) Dave Jiang posted a PoC a while back for generic > > ports. > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore. > > kernel.org%2Fqemu- > > devel%2F168185633821.899932.322047053764766056.stgit%40djiang5- > > mobl3%2F&data=05%7C01%7Cvsethi%40nvidia.com%7C846b19f87bc5424d > > c33608dbc8c3015d%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7 > > C638324512146712954%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjA > > wMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C% > > 7C%7C&sdata=v318MXognoITHyv7AFqZAfvUi2hLy2ZUVnLvyQ2IAfY%3D&res > > erved=0 > > > > My concern with this approach is that it is using a side effect of a Linux > > implementation detail that the infra structure to bring up coherent memory > > is all present even for a GI only node (if it is which I can't recall) I'm also fairly > > sure we never tidied up the detail of going from the GI to the device in Linux > > (because it's harder than a _PXM entry for the device). It requires stashing a > > better description than the BDF before potentially doing reenumeration so > > that we can rebuild the association after that is done. > > > > I'm not sure I understood the concern. Are you suggesting that the ACPI specification > somehow prohibits memory associated with a GI node in the same PXM? i.e whether the GI is memory-less > or with memory isn't mandated by the spec IIRC. Certainly seems perfectly normal > for an accelerator with memory to have a GI with memory and that memory be able to be associated with the same PXM. Indeed reasonable that a GI would have associated memory, but if it's "normal memory" (i.e. coherent and not device private memory accessed by PCI bar etc) then expectation would be that memory is in SRAT as a memory entry. Which brings us back to the original question of whether 0 sized memory nodes are fine. > So what about this patch is using a Linux implementation detail? Even if Linux wasn't currently supporting > that use case, it is something that would have been reasonable to add IMO. What am I missing? Linux is careful to only bring up the infrastructure for specific types of roximity node. It works its way through SRAT and sets appropriate bitmap bits to say which combination of PXM node types a given node is. (CPU, Memory, GI etc) After that walk is done it then brings up various infrastructure. What I can't remember (you'll need to experiment) is if there is anything not brought up for a non Memory node that you would need. Might be fine, but that doesn't mean it will remain fine. Maybe we just need to make sure the documentation / comments in Linux cover this use case. You are on your own for what other OSes decided is valid here as the specifcation does not mention this AFAIK. If it does then add a reference. There is a non trivial (potential) cost to enabling facilities on NUMA nodes that will never make use of them - a bunch of longer searches etc when looking for memory. For GIs we enable pretty much everything a CPU node uses. That was controversial though only well after support was already in - the controversy being that it added costs to paths that didn't care about GIs. Basically it boils down to using unexpected corners of specifications may prove fragile. For one thing I'm doubtful if the NUMA description the kernel exposes (coming from a subset of HMAT) won't deal with this case. Not tried it though so you may be lucky. Jonathan