From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.20]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F2F866EB45 for ; Fri, 12 Jan 2024 15:57:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="AsbRgd9k" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1705075058; x=1736611058; h=message-id:date:subject:to:cc:references:from: in-reply-to:content-transfer-encoding:mime-version; bh=73HJAwXIdM/qtHIa+LS9oupovQgQvLJEmxFlVErkg+k=; b=AsbRgd9kF8y3lP59q1BVFM4pOplhKQhDOdrCCK7LqPFHAQn3uOypVehZ L397QJGWCZaP/rhYNgNHI+PnYvGm54BtJsTOfALey4/08T0h6dCi4hDmq VVXNSpwy3kkbict9D7WwS9rlvEiIuq8uFdjMGqyQwcc0oNL8d09aLakDP TeKQ2FZADLj5P1Vn5Aiy88c/tFguLuvvwbiHFLpaJO+bNULHoQ6m/F9C+ d/WCMVmB29BQ5nY+V/QBMwzYvESmpCOvVF9iuFDP5jP3MUL83VnznuT9X seaeL3Qv0gJKX0YDuejmpOXHphOvNd+oPzTxwR48yQrXLKCxQsSXLTx2+ Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10951"; a="389644856" X-IronPort-AV: E=Sophos;i="6.04,190,1695711600"; d="scan'208";a="389644856" Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jan 2024 07:57:38 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10951"; a="902007369" X-IronPort-AV: E=Sophos;i="6.04,190,1695711600"; d="scan'208";a="902007369" Received: from fmsmsx602.amr.corp.intel.com ([10.18.126.82]) by fmsmga002.fm.intel.com with ESMTP/TLS/AES256-GCM-SHA384; 12 Jan 2024 07:57:37 -0800 Received: from fmsmsx610.amr.corp.intel.com (10.18.126.90) by fmsmsx602.amr.corp.intel.com (10.18.126.82) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35; Fri, 12 Jan 2024 07:57:37 -0800 Received: from fmsedg602.ED.cps.intel.com (10.1.192.136) by fmsmsx610.amr.corp.intel.com (10.18.126.90) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35 via Frontend Transport; Fri, 12 Jan 2024 07:57:37 -0800 Received: from NAM02-SN1-obe.outbound.protection.outlook.com (104.47.57.41) by edgegateway.intel.com (192.55.55.71) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.35; Fri, 12 Jan 2024 07:57:36 -0800 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=Arua6CjMpOyJsJFkP2YHCp3QxsJvCrG9qm/KoIUX2llIJv1O0bBArlGj26yn7XtZn8iEssTMxCf6sZRaT9bqcHpR59ctofqpdLMdM6IabexOjsEUrQXV2hcZd0LkW+KaQrBUy8cmBoq9En3qHbubaBELrXKj7yfUuYNtZRrAvAvAMyDj2Y03C20FN5lMPcFOJWezfowVX1EHsnqhsr9Dd3P21CaZpKxyKiTY6Xa9C+r44n0qLu68l9EWdWikQxGJ+MQegNMD4pEC57881brATbfVdtGMmryy7gwP13L4MtYHFu0JFpIRoZaxzPrtvtLDpmskAmJVSf3jq4UCLKmjkA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=E3ml5O4jbHO6E1YVwIBhW1uzSp+mWct87xp5Rc3eGKc=; b=CFHjKC3GFZtewc9syUbFmjQHVYX+HBhp3j36npb3tSuXT4U7iu4JLWd/6XAK7MNedY/uabjaNC3JaH6LkXlVGzL9UWPOH9aSXpDvPuIe2gOZhWRGHem6RldTtvv2dhcUHYfcDQXee+PVOOB30xdkFIa+0JHhnZ7XOeat8adf1Wa/xs5NTGydNg94gy5jD1fVlEPOluu8EOgFGJ4bMTZ5ClQnVvWzIT/dz//XIHyeXw9E/QjKbalOSL/pi7TrXJPEHaib7NlgK5Ocgz3Vn7brotSxRjg+ZqW7GBzBpv3y6Y9ZKy6zWkDZkVErnz3vV9YEnB7Hl3OQS8TWSaNuwD0cZw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=intel.com; dmarc=pass action=none header.from=intel.com; dkim=pass header.d=intel.com; arc=none Authentication-Results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=intel.com; Received: from PH7PR11MB5984.namprd11.prod.outlook.com (2603:10b6:510:1e3::15) by BL1PR11MB5255.namprd11.prod.outlook.com (2603:10b6:208:31a::9) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7181.19; Fri, 12 Jan 2024 15:57:32 +0000 Received: from PH7PR11MB5984.namprd11.prod.outlook.com ([fe80::6f7b:337d:383c:7ad1]) by PH7PR11MB5984.namprd11.prod.outlook.com ([fe80::6f7b:337d:383c:7ad1%4]) with mapi id 15.20.7181.019; Fri, 12 Jan 2024 15:57:31 +0000 Message-ID: Date: Fri, 12 Jan 2024 08:57:28 -0700 User-Agent: Betterbird (Linux) Subject: Re: [PATCH v3 3/3] cxl: Add memory hotplug notifier for cxl region To: Jonathan Cameron CC: Dan Williams , "Huang, Ying" , , Greg Kroah-Hartman , "Rafael J. Wysocki" , , , , References: <170441200977.3574076.13110207881243626581.stgit@djiang5-mobl3> <170441211484.3574076.5894396662836000435.stgit@djiang5-mobl3> <87r0is9v6o.fsf@yhuang6-desk2.ccr.corp.intel.com> <20240108121538.00001369@Huawei.com> <38be52cf-a4ea-402c-9b14-47a80427f0c8@intel.com> <20240109162753.00005b2b@Huawei.com> <659d9e563e1fd_24a8294b7@dwillia2-xfh.jf.intel.com.notmuch> <20240110100048.000062c0@Huawei.com> <6132c975-3208-40f4-a710-b7b7dac608a1@intel.com> <20240112113023.00006c50@Huawei.com> Content-Language: en-US From: Dave Jiang In-Reply-To: <20240112113023.00006c50@Huawei.com> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-ClientProxiedBy: SJ2PR07CA0023.namprd07.prod.outlook.com (2603:10b6:a03:505::9) To PH7PR11MB5984.namprd11.prod.outlook.com (2603:10b6:510:1e3::15) Precedence: bulk X-Mailing-List: linux-cxl@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: PH7PR11MB5984:EE_|BL1PR11MB5255:EE_ X-MS-Office365-Filtering-Correlation-Id: 62a5817e-b424-4ebf-2b28-08dc13872f54 X-LD-Processed: 46c98d88-e344-4ed4-8496-4ed7712e255d,ExtAddr X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: +b+2W0GD0BZtigtJAVsEnTcx2RiddAs3fA1FWmIP29FAZIr9oq8WlFvKXPeNLUj/kvVIOD2fFL34eZfxZ+yr6J+uoD04+cwC/9eLjyQdETaLRaqAvjI+sCRe0P7A8kgHnXCd530M2gKsXGQhdKS6dPzMYCXs7+c/730pwZPejQ4vc5HdypdNzAj333bepcfz5vR2/wKW55MmwMLD6iFpoQmSl7q64NcYjcgVSQSb2MmoAkTCWzP5wI0qM67zKNTX8jNbwkOHHLbTSiqR29m3pp36fe3maFNPZ6eejVRYL7xj900qBRjfCjIMgE3+bE85Hdf53sNm6GJcTc+PXLM4MV9RtKT7WiIdjqf5mgo5NMRHOtKn3UrT+wSqAbyy3i3ysr7qX5lZWp9P/NhrzJj1i3paYqp0b9rSxGjQMtkXGDRUaX+XgIAE9C1W8p4e/OGgPRggzjRNs7FRaDe1fwICTuAi2trRg10rD0pf82FdE1ENzyCK9mz1qt6SzAxmppO/L0g7eILPxVNoZuD0s+64pSGp3rV6jPRqxQHtPkg3d9TkdHfWQF3r7fT2lei51fCYbfpfeJ4Jdgiy2BcOKN0uMCDtH84nDsLdI9U3Y9nG11oCyl14fQbeaS8G27S0mInl0Rs7QIYElmuH2Kg6sET7QA== X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:PH7PR11MB5984.namprd11.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230031)(136003)(396003)(346002)(366004)(39860400002)(376002)(230922051799003)(451199024)(1800799012)(186009)(64100799003)(2616005)(66556008)(6916009)(66476007)(66946007)(316002)(53546011)(54906003)(6512007)(8676002)(86362001)(26005)(8936002)(4326008)(6666004)(966005)(6506007)(31686004)(478600001)(82960400001)(31696002)(6486002)(44832011)(38100700002)(30864003)(2906002)(5660300002)(41300700001)(83380400001)(36756003)(45980500001);DIR:OUT;SFP:1102; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?utf-8?B?Z1Vpb2dPVUpFdmlhNC96dFlTTGFmZlMySVF6cHFMOWRLMEd5RlhWM1lZbXhi?= =?utf-8?B?Wmw1VEdWOTBxTDRVV1Fhb1o0eFU3bGxvb2ZnMU1GdFozakwrSFFkVE9jUzdW?= =?utf-8?B?TXZqb3RibzJiQzQ3V0I5R1h4bUplMTVGalBXV2J5ckFBN1NQWGtXcTFKMmpi?= =?utf-8?B?U1R0OFU5U2RyYnA1OG92Z016V3VsbzlsQWZzUjBxVmNBV1o1VmxqRVVMOVlo?= =?utf-8?B?Smplb2UrMDVwbkdyUkNGdjg4azZ5Tk9rU2RnbFd3UTk1OHBWVUZxZW1RUVkx?= =?utf-8?B?aC9GdW1sT3I4QzBFTzB4c0pFcVhDWFNTUXp2a0xlc2RBT2dWSVJZMTFKSFFJ?= =?utf-8?B?L3o5VWFBU2pQVXE1bVRzbDd0elBSYTB1RTJrRmtvR2xvWUZUaFdWTElocDFa?= =?utf-8?B?UHg2SHNUbCtGeVJEVjRyTFBHK3ZuWU54WXBBNDUvTi9uUjZHMHpIZEZrM0Zl?= =?utf-8?B?QitGS293b08wQVoyQTBpVnExU285djQ5QTZBZWtPMFI0SU9Nb0RyUU5rQy9G?= =?utf-8?B?TC92cWEvT3E5eUM4cDRPbTZyVWxHeDZqTitQYUFIT21PQWhCS3BVb210a2h5?= =?utf-8?B?RXM3NEQ5ME5PbFNOU0pPTExHWWZCcXdKVElVUGxyZGlQamNZYVhZMHQxVDVS?= =?utf-8?B?UGJPdll2RG5wMm5nUS9iV0JBZmI3aC9Bb1gwcEYvQmVJU2dHeWdDbXRsNVhV?= =?utf-8?B?d1gwbFBKYlVkampPUWtXN3ZFQmc2SENmMXcvTC9FSGR1S1hyQ1NVdGcxblZI?= =?utf-8?B?aW04OW9HczBmU1JuT2ptMEJXazcwS1JqK1piSlIxOTgzbnRjVFIwdWN6MUU1?= =?utf-8?B?NnZjYXRMQWdQVVd5OTVicGswalAxSVhwRWlxSkxnRXdobDRGUk1GQWVQVkRD?= =?utf-8?B?Z0pWeE5OWlJVYStac1lwQ250dWxjbEdtV0F2anJSVUxQelRNd3Urc3VvUUsz?= =?utf-8?B?R3RrMnVoMHZQSmQxQm9qdDhXM09xWVNORU1SQ2ZzMWZrdWVzS0Y5N0FydFFM?= =?utf-8?B?YlNtRWdJZWs0SVFEV0pwelhobzB6V1RrQ2tXUGtrdTlpTnBKb3BTd1JsVWw0?= =?utf-8?B?V3BPcURHTHdwVTZ5VTRadmFacjJBYVpGajh5eU9lMlJkcWx5VnF3WFpWbi80?= =?utf-8?B?b3RTMVhNQ1Nlenp6ZEIwVkd4NTdjU3RjbjNoOEJrRDVsOXpGTTU1ZDhUWVZM?= =?utf-8?B?aEd3Z1VIZnk5NW03UjJZMkhjeTZNbkg4N2JGL2FubUlOQWFxRkliWXoxa2dy?= =?utf-8?B?dUZkNWRkRW9nSmVTS3FYZjNyY1pRZUdVMVNnOXRqVE5GaWJheUQ5UEtoWGRP?= =?utf-8?B?WThuRmpwTS9NRlNLZmlRanRIMWR6QlZEWFpPaW5SVlR4dGJja3VHTXNxSmwz?= =?utf-8?B?ZE9QM0FMNU5VZEdSY1hENUY5OUlpUEVobUQ5MEhTNkdnZXd3T3pJcCtLbnBw?= =?utf-8?B?bytLaWRrNU1xamFOVThEUnhYdldTVTZZczNFb3FGdnFTSzc2ZGQyUG9pZmdt?= =?utf-8?B?em1mM05vYnN1RjV0Y0VkRDkxMFpISVhZb1AybFI2Q2pWdWJMU0dBZHFOOE5i?= =?utf-8?B?SWRqMURlVytKMkwzWFNpY05BQjN1M2gvUnorZ0o3dm9jZVlyaTdQNVowSGM3?= =?utf-8?B?RkVvS1NacU5jejNFK0lCUGErVEpIcHZTd0JkSnVSV1FneGdkdTZjZGd3YzNu?= =?utf-8?B?dHVoa3YyWlZsMm9nQmE5S0dxZTNFdjcrYXdxOGJLenFGZW00NGVrZzRSRkNi?= =?utf-8?B?WGkyRGZteEczV1RESThHdUpzckVsbWJmNThZcmh5UXNxTDZnNVh3NFZXem5u?= =?utf-8?B?dkkvRzZuZ2Y1eWlFQmFYcCtlMndqYmQyWHFIYVRtNG92QitqcjBiUnNib1I0?= =?utf-8?B?U3lQcTJhcDB6MXg1c1p2Y0hRUjcwbGU5dHBFaGJXUUMwUGZiOVdzeFpjNmRT?= =?utf-8?B?WFQzSzVwUW02cXpkdXJzTnpRWVBQYTZRVU1HVXByUk05Z3M0ekUzeUFGSTJx?= =?utf-8?B?aEtNeUZZSklKeGtmVVFBeVN4K0R4UnNiNmRuYUR5L3cwMEJrRDdZWjBUS2Rk?= =?utf-8?B?aFJLcy9UY2tVTUVNMnNGcjhNSkVwZG5PeC9jVitUWjZtMEZtN1FOMTdHR3Y2?= =?utf-8?Q?DPpvibz0CugFvDNdrxJSvmR0H?= X-MS-Exchange-CrossTenant-Network-Message-Id: 62a5817e-b424-4ebf-2b28-08dc13872f54 X-MS-Exchange-CrossTenant-AuthSource: PH7PR11MB5984.namprd11.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 12 Jan 2024 15:57:31.6414 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 46c98d88-e344-4ed4-8496-4ed7712e255d X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: zX+7QH8nqbnHtxXnjoiQmEI45aR52uwiePhMPLPOjD6+0QIkrAijnB4/YcyI5R00RzptDuj0BuSEDIs79EXVRw== X-MS-Exchange-Transport-CrossTenantHeadersStamped: BL1PR11MB5255 X-OriginatorOrg: intel.com On 1/12/24 04:30, Jonathan Cameron wrote: > On Wed, 10 Jan 2024 08:27:57 -0700 > Dave Jiang wrote: > >> On 1/10/24 03:00, Jonathan Cameron wrote: >>> On Tue, 9 Jan 2024 11:28:22 -0800 >>> Dan Williams wrote: >>> >>>> Jonathan Cameron wrote: >>>>> On Mon, 8 Jan 2024 11:18:33 -0700 >>>>> Dave Jiang wrote: >>>>> >>>>>> On 1/8/24 05:15, Jonathan Cameron wrote: >>>>>>> On Mon, 08 Jan 2024 14:49:03 +0800 >>>>>>> "Huang, Ying" wrote: >>>>>>> >>>>>>>> Dave Jiang writes: >>>>>>>> >>>>>>>>> When the CXL region is formed, the driver would computed the performance >>>>>>>>> data for the region. However this data is not available at the node data >>>>>>>>> collection that has been populated by the HMAT during kernel >>>>>>>>> initialization. Add a memory hotplug notifier to update the performance >>>>>>>>> data to the node hmem_attrs to expose the newly calculated region >>>>>>>>> performance data. The CXL region is created under specific CFMWS. The >>>>>>>>> node for the CFMWS is created during SRAT parsing by acpi_parse_cfmws(). >>>>>>>>> Additional regions may overwrite the initial data, but since this is >>>>>>>>> for the same proximity domain it's a don't care for now. >>>>>>>>> >>>>>>>>> node_set_perf_attrs() symbol is exported to allow update of perf attribs >>>>>>>>> for a node. The sysfs path of >>>>>>>>> /sys/devices/system/node/nodeX/access0/initiators/* is created by >>>>>>>>> ndoe_set_perf_attrs() for the various attributes where nodeX is matched >>>>>>>>> to the proximity domain of the CXL region. >>>>>>> >>>>>>> As per discussion below. Why is access1 not also relevant for CXL memory? >>>>>>> (it's probably more relevant than access0 in many cases!) >>>>>>> >>>>>>> For historical references, I wanted access0 to be the CPU only one, but >>>>>>> review feedback was that access0 was already defined as 'initiator based' >>>>>>> so we couldn't just make the 0 indexed one the case most people care about. >>>>>>> Hence we grew access1 to cover the CPU only case which most software cares >>>>>>> about. >>>>>>> >>>>>>>>> >>>>>>>>> Cc: Greg Kroah-Hartman >>>>>>>>> Cc: Rafael J. Wysocki >>>>>>>>> Reviewed-by: "Huang, Ying" >>>>>>>>> Signed-off-by: Dave Jiang >>>>>>>>> --- >>>>>>>>> v3: >>>>>>>>> - Change EXPORT_SYMBOL_NS_GPL(,CXL) to EXPORT_SYMBOL_GPL() (Jonathan) >>>>>>>>> - use read_bandwidth as check for valid coords (Jonathan) >>>>>>>>> - Remove setting of coord access level 1. (Jonathan) >>>>>>>>> --- >>>>>>>>> drivers/base/node.c | 1 + >>>>>>>>> drivers/cxl/core/region.c | 42 ++++++++++++++++++++++++++++++++++++++++++ >>>>>>>>> drivers/cxl/cxl.h | 3 +++ >>>>>>>>> 3 files changed, 46 insertions(+) >>>>>>>>> >>>>>>>>> diff --git a/drivers/base/node.c b/drivers/base/node.c >>>>>>>>> index cb2b6cc7f6e6..48e5cb292765 100644 >>>>>>>>> --- a/drivers/base/node.c >>>>>>>>> +++ b/drivers/base/node.c >>>>>>>>> @@ -215,6 +215,7 @@ void node_set_perf_attrs(unsigned int nid, struct access_coordinate *coord, >>>>>>>>> } >>>>>>>>> } >>>>>>>>> } >>>>>>>>> +EXPORT_SYMBOL_GPL(node_set_perf_attrs); >>>>>>>>> >>>>>>>>> /** >>>>>>>>> * struct node_cache_info - Internal tracking for memory node caches >>>>>>>>> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c >>>>>>>>> index d28d24524d41..bee65f535d6c 100644 >>>>>>>>> --- a/drivers/cxl/core/region.c >>>>>>>>> +++ b/drivers/cxl/core/region.c >>>>>>>>> @@ -4,6 +4,7 @@ >>>>>>>>> #include >>>>>>>>> #include >>>>>>>>> #include >>>>>>>>> +#include >>>>>>>>> #include >>>>>>>>> #include >>>>>>>>> #include >>>>>>>>> @@ -2972,6 +2973,42 @@ static int is_system_ram(struct resource *res, void *arg) >>>>>>>>> return 1; >>>>>>>>> } >>>>>>>>> >>>>>>>>> +static int cxl_region_perf_attrs_callback(struct notifier_block *nb, >>>>>>>>> + unsigned long action, void *arg) >>>>>>>>> +{ >>>>>>>>> + struct cxl_region *cxlr = container_of(nb, struct cxl_region, >>>>>>>>> + memory_notifier); >>>>>>>>> + struct cxl_region_params *p = &cxlr->params; >>>>>>>>> + struct cxl_endpoint_decoder *cxled = p->targets[0]; >>>>>>>>> + struct cxl_decoder *cxld = &cxled->cxld; >>>>>>>>> + struct memory_notify *mnb = arg; >>>>>>>>> + int nid = mnb->status_change_nid; >>>>>>>>> + int region_nid; >>>>>>>>> + >>>>>>>>> + if (nid == NUMA_NO_NODE || action != MEM_ONLINE) >>>>>>>>> + return NOTIFY_DONE; >>>>>>>>> + >>>>>>>>> + region_nid = phys_to_target_node(cxld->hpa_range.start); >>>>>>>>> + if (nid != region_nid) >>>>>>>>> + return NOTIFY_DONE; >>>>>>>>> + >>>>>>>>> + /* Don't set if there's no coordinate information */ >>>>>>>>> + if (!cxlr->coord.write_bandwidth) >>>>>>>>> + return NOTIFY_DONE; >>>>>>>> >>>>>>>> Although you said you will use "read_bandwidth" in changelog, you >>>>>>>> actually didn't do that. >>>>>>>> >>>>>>>>> + >>>>>>>>> + node_set_perf_attrs(nid, &cxlr->coord, 0); >>>>>>>>> + node_set_perf_attrs(nid, &cxlr->coord, 1); >>>>>>>> >>>>>>>> And this. >>>>>>>> >>>>>>>> But I don't think it's good to remove access level 1. According to >>>>>>>> commit b9fffe47212c ("node: Add access1 class to represent CPU to memory >>>>>>>> characteristics"). Access level 1 is for performance from CPU to >>>>>>>> memory. So, we should keep access level 1. For CXL memory device, >>>>>>>> access level 0 and access level 1 should be equivalent. Will the code >>>>>>>> be used for something like GPU connected via CXL? Where the access >>>>>>>> level 0 may be for the performance from GPU to the memory. >>>>>>>> >>>>>>> I disagree. They are no more equivalent than they are on any other complex system. >>>>>>> >>>>>>> e.g. A CXL root port being described using generic Port infrastructure may be >>>>>>> on a different die (IO dies are a common architecture) in the package >>>>>>> than the CPU cores and that IO die may well have generic initiators that >>>>>>> are much nearer than the CPU cores. >>>>>>> >>>>>>> In those cases access0 will cover initators on the IO die but access1 will >>>>>>> cover the nearest CPU cores (initiators). >>>>>>> >>>>>>> Both should arguably be there for CXL memory as both are as relevant as >>>>>>> they are for any other memory. >>>>>>> >>>>>>> If / when we get some GPUs etc on CXL that are initiators this will all >>>>>>> get a lot more fun but for now we can kick that into the long grass. >>>>>> >>>>>> >>>>>> With the current way of storing HMAT targets information, only the >>>>>> best performance data is stored (access0). The existing HMAT handling >>>>>> code also sets the access1 if the associated initiator node contains >>>>>> a CPU for conventional memory. The current calculated full CXL path >>>>>> is the access0 data. I think what's missing is the check to see if >>>>>> the associated initiator node is also a CPU node and sets access1 >>>>>> conditionally based on that. Maybe if that conditional gets added >>>>>> then that is ok for what we have now? >>>>> >>>>> You also need the access1 initiators to be figured out (nearest >>>>> one that has a CPU) - so two separate sets of calculations. >>>>> Could short cut the maths if they happen to be the same node of >>>>> course. >>>> >>>> Where is "access1" coming from? The generic port is the only performance >>>> profile that is being calculated by the CDAT code and there is no other >>>> initiator. >>> >>> This isn't about initiators on the CXL side of the port (for now anyway). >>> It's about intiators in the host system. >>> >>>> >>>> Now if "access1" is a convention of "that's the CPU" then we should skip >>>> emitting access0 altogether and reserve that for some future accelerator >>>> case that can define a better access profile talking to its own local >>>> memory. Otherwise having access0 and access1 when the only initiator is >>>> the generic port (which includes all CPUs attached to that generic port) >>>> does not resolve for me. >>> >>> The initiators here are: >>> >>> * CPUs in the host - due to limitations of the HMAT presentation that actually >>> means those CPUs in the host that are nearest to the generic port. Only >>> these are considered for access1. So for simple placement decisions on >>> CPU only workloads this is what matters. >>> * Other initiators in the host such NICs on PCI (typically ones that >>> are presented at RCiEPs or behind 'fake' switches but actually in the same >>> die as the root complex) These and CPUs are included for access0 >>> * (not supported yet). Other initiators in the CXL fabric. >>> >>> My ancient white paper needs an update to include generic ports as they do >>> make things more complex. >>> https://github.com/hisilicon/acpi-numa-whitepaper/releases/download/v0.93/NUMA_Description_Under_ACPI_6.3_v0.93.pdf >>> >>> Anyhow: ASCI art time. (simplified diagram of an old production CPU with CXL added >>> where the PCI RC is - so no future product info but expect to see systems that >>> looks similar to this :)) >>> >>> Note the IO die might also be in the middle, or my "favorite" design - in a separate >>> package entirely - IO expanders on the inter socket interconnect - (UPI or similar) ;) >>> Note these might not be physical systems - an example is a VM workload >>> which occasionally needs to use an 'extra' GPU. That GPU comes from a host socket >>> on which the VM has no CPU resources or memory. Anyhow given the diagrams >>> I've seen of production systems pretty much anything you can conceive is is being >>> built by someone. >>> >>> ________________ __________________ >>> | | | | >>> | Host DDR(PXM0)| | Host DDR (PXM1) | >>> |________________| |__________________| >>> | | >>> _______|_______ _______|____ _________________ >>> | | | | | | >>> | CPU Die |-------| CPU Die |-------| IO DIE | >>> | PXM 0 | | PXM 1 | | PXM 2 | >>> | | | | | NIC (GP + GI) | >>> |______________| |____________| |________________| >>> | >>> _______|________ >>> | | >>> | CXL Mem | >>> | | >>> |_______________| >>> >>> So in this case access0 should have PXM2 as the initiator and include >>> the bandwidth and latency numbers from PXM2 to itself (where the GP is) >>> and those to the CXL memory that Dave's code adds in. >>> >>> Access1 is from PXM1 to PXM2 (to get to the GP) and on to the CXL mem. >>> >>> Note that one reason to do this is that the latency from the NIC in PXM2 >>> to CXL mem might well be not much worse than from it to the memory on PXM 1 >>> (cpu Die) so placement decisions might favor putting NIC buffers in CXL mem >>> particularly if the bandwidth is good. >>> >>> >>> >>>> >>>>>> If/When the non-CPU initiators shows up for CXL, we'll need to change >>>>>> the way to store the initiator to generic target table data and how >>>>>> we calculate and setup access0 vs access1. Maybe that can be done as >>>>>> a later iteration? >>>>> >>>>> I'm not that bothered yet about CXL initiators - the issue today >>>>> is ones on a different node the host side of the root ports. >>>>> >>>>> For giggles the NVIDIA Grace proposals for how they manage their >>>>> GPU partitioning will create a bunch of GI nodes that may well >>>>> be nearer to the CXL ports - I've no idea! >>>>> https://lore.kernel.org/qemu-devel/20231225045603.7654-1-ankita@nvidia.com/ >>>> >>>> It seems sad that we, as an industry, went through all this trouble to >>>> define an enumerable CXL bus only to fall back to ACPI for enumeration. >>> >>> History - a lot of this stuff was in design before CXL surfaced. >>> I think we are all pushing for people to reuse the CXL defined infrastructure >>> (or similar) in the long term to make everything enumerable. >>> >>> Arguably for a host system ACPI is the enumeration method... >>> >>> >>>> >>>> The Linux reaction to CFMWS takes a "Linux likely needs *at least* this many >>>> memory target nodes considered at the beginning of time", with a "circle >>>> back to the dynamic node creation problem later if it proves to be >>>> insufficient". The NVIDIA proposal appears to be crossing that >>>> threshold, and I will go invite them to do the work to dynamically >>>> enumerate initiators into the Linux tracking structures. >>> >>> Absolutely - various replies in earlier threads made that point >>> (and that everyone has been kicking that tire down the road for years). >>> >>>> >>>> As for where this leaves this patchset, it is clear from this >>>> conversation that v6.9 is a better target for clarifying this NUMA >>>> information, but I think it is ok to move ahead with the base CDAT >>>> parsing for v6.8 (the bits that are already exposed to linux-next). Any >>>> objections? >>>> >>> Should be fine if we keep away from the userspace exposed new bits >>> (though I think we can clarify them fairly fast - it's a bit late ;( >> >> The only exposure to user space is the QTG ID (qos_class) based on access0 generic target numbers. > > That's a fun corner case. Which one is appropriate? My gut feeling is access1 I can fix that in my next rev of the HMEM_REPORTING series if your guidance is to go with access1. It should be a 1 line change within the new changes coming that addresses access0 and access1 for CXL. DJ > but if the GI is in the host, then the QOS may want to reflect that. If the GI > is on the CXL bus then whether host QOS matters is dependent on whether P2P is > going on and where the bottleneck that is driving the QOS decisions in the host > is. > > Let us cross our fingers that that this never matters in practice. I can't > face trying to get a clarification on this in appropriate standards groups given > it's all a bit hypothetical for QOS (hence QTG selection) at least. > > Jonathan > >>> >>> Jonathan >>> >>> >