From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from SJ2PR03CU001.outbound.protection.outlook.com (mail-westusazon11012028.outbound.protection.outlook.com [52.101.43.28]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CBF2D7262F for ; Mon, 30 Mar 2026 04:46:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=52.101.43.28 ARC-Seal:i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774845994; cv=fail; b=dopAx/ovUPjayFU+9b8Mx40af9/+m0bV8sANqyjDjajlgBPLs2pOP6f2hHK/QcpyvVKVgbVWohHM+iiVOui+7a2IU4tqq5S/Hcw9JNPHPd4+03ltF56So+KSbLpM2cxb1UOtHcMX3YeqK3AqqYwdVBXUl7VP/a7iGeFGFF0N8/8= ARC-Message-Signature:i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774845994; c=relaxed/simple; bh=q/qeoej4j0C3yqCp7fifOh1ugLupHOglhu0DdhgH7P0=; h=Message-ID:Date:MIME-Version:Subject:To:CC:References:From: In-Reply-To:Content-Type; b=rbunOGy2aeNvKyj4f62Y/9T7RYLR87qnJQ6FdLu22FRR4bQ2ejPN2ST0DPZx/8t3lKp2K+OTnwrO3NwJesBbnLHRWOByFcZlutyY92P+CqCTK0UQJA9qKqAdqb0RWgFKxFD+AEr6qVXRFtIkmOIVP/9YEmKhAsv191W0dwZeOUk= ARC-Authentication-Results:i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=wkIKLMSY; arc=fail smtp.client-ip=52.101.43.28 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="wkIKLMSY" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=ij4vKd91KKUg/Xr6pOyqtPKXiv3z+k3jxjswdRoEZjoOzzLeVNZ8hWH3KRyO9AcfZs8XFhv9i+pQcpja8SMbamdK5r5oQxNTruM7Ra4Hxw6ib2+/g/KkFKiA60taD5/9RelSsDqZeuu6zLr2cMQGHrtqAEyg/5IyPwEkT7TQMhjjD5KKCQtrqhQA4qbElLQxx8eunPJm0ZAa1pKGMuSxAfS+elk54xEgJUcFLejK6z/pzhhPb7ONO8fDSXBiak+zyqmkAAwreifnMLZfrYo01FmBnynq0j+vYzA1kfdcRudndHo7C+u5lSj8o9AuJ6MCVr8ekfeRpAXZNjeSawqrFg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=DyrqdJUcmq03H96fGIEKHLnzq/UgP9UdVM0CH+pn7N8=; b=uooFi4D4at4XUWLekY8xFuhxYxwjOZdr/mzTk1Pwu/qJALT4dQ7oKoCH2mvoroW6ZsgxloV21P4d3JuI0MfGS/YPoN1eD25SufckFYVBV1g9DT2AHb99vtb39jEM9cv99HNwB7nrHs/yEcovX/05s94bUnuXGODxEbDYPCF3gabG+nBSuQXp3aMIuIanGUVv9W9M2mPBUuvMUxcKt98AU+7UQO53h6QqkEPT3nZoh09DlMI4o8oem3oIeBZJDa90BW+t0BZLs7/+Z+S569OMuUc9ndc71W7KBRfNDdAUmxSDeBD0Gvnk1EqqkVa1jT3l6A84MqZiu2Xu6a8QLpm0Jw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=DyrqdJUcmq03H96fGIEKHLnzq/UgP9UdVM0CH+pn7N8=; b=wkIKLMSYk5yeHFf7/oyIDQYTHfIJOmcoQNVfnkN5EY0aZ0oEvNB/hwgqo+qGXDXcEoyLsiH9jlpOO6Yfy2oF3YLxKApy25k9imgU7tzN5Sz8IlIZk1lewZ0tPvavBkob8yOE9ddskXlW38mYWLXTwIr+QDwy1MCW63vlrPml9cg= Received: from BYAPR04CA0016.namprd04.prod.outlook.com (2603:10b6:a03:40::29) by DS2PR12MB9822.namprd12.prod.outlook.com (2603:10b6:8:2ba::19) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9769.14; Mon, 30 Mar 2026 04:46:27 +0000 Received: from SJ5PEPF00000206.namprd05.prod.outlook.com (2603:10b6:a03:40:cafe::e9) by BYAPR04CA0016.outlook.office365.com (2603:10b6:a03:40::29) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9745.28 via Frontend Transport; Mon, 30 Mar 2026 04:46:27 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=satlexmb08.amd.com; pr=C Received: from satlexmb08.amd.com (165.204.84.17) by SJ5PEPF00000206.mail.protection.outlook.com (10.167.244.39) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9745.21 via Frontend Transport; Mon, 30 Mar 2026 04:46:27 +0000 Received: from satlexmb07.amd.com (10.181.42.216) by satlexmb08.amd.com (10.181.42.217) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Sun, 29 Mar 2026 23:46:26 -0500 Received: from [10.252.223.214] (10.180.168.240) by satlexmb07.amd.com (10.181.42.216) with Microsoft SMTP Server id 15.2.2562.17 via Frontend Transport; Sun, 29 Mar 2026 23:46:19 -0500 Message-ID: <12112823-f4b7-4854-a32c-c40985c65521@amd.com> Date: Mon, 30 Mar 2026 10:16:19 +0530 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH v6 5/5] mm: sched: move NUMA balancing tiering promotion to pghot To: , CC: , , , , , , , , , , , , , , , , , , , , , , , , , , References: <20260323095104.238982-1-bharata@amd.com> <20260323095104.238982-6-bharata@amd.com> Content-Language: en-US From: Bharata B Rao In-Reply-To: <20260323095104.238982-6-bharata@amd.com> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: SJ5PEPF00000206:EE_|DS2PR12MB9822:EE_ X-MS-Office365-Filtering-Correlation-Id: e4b5f041-b1a2-4884-2d4c-08de8e174dc8 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|1800799024|82310400026|7416014|36860700016|376014|56012099003|22082099003|18002099003|13003099007; X-Microsoft-Antispam-Message-Info: WXLZ9q02rpnkRbbM4jWngniHQHf/kudJ36xm2RGafIjXyfLcBx/FXldnkmQ9V4qiSS6MDkfZdGMdyIww0rn9s3/mHkBwvlOyB0KGIrLWiVXgcUUZJOp4MNd3z3cBwWGNFkYjZztb7AS0g01leEWuD3r74rgZYIYafLRLJuGoHj/SXqLnma+IUKHdVC314N+WJyAYw+uBc6ALJarhGB1Q8HGXl5hmbI+LXtBvU210eG7+dFHWem+tRaLGu62TqAynAEckoasTaxGA9cjJd/nuDrQFYCq0Lzqap3rCt/yEBFJne9hUKHru/DFufcdsRLup2PG4wWJ2xFLHhEU50jKiyaZeUaBBVgxf8qaSuRG7cgypnHzrvAHs9iH3pZT0eC4BoinEcqEyHogVZmzI1JiippWzjH28lr3C/Ykf17RhD7lSst+Z+3Q4gEZe8Gvk7orA0JWzrrYo9yW+8MpkkfW+LFXvsO4XHMhlD+AJBfNTjFenF1w2Y0v0/w1BmGlmJnZt4Kwkcmiyzt6YtiyPer19Xwk6CURWs5dZBg/rnI/VdiKd70gV4iVt8OBdeMXAYy1pEa48bc8uJt81dxt96GJghtnv6vIJHnk9QR6E/Z1RmPeJfzmJsfuPVDmvKVA8xiPHSmHu2C4KtcuZxN21C1inSctQ+3Yz2blbVesIHU1YHABGYoeE4JDWBo33uO5bvRvOdl6JKa5NS8OW6jyeQyaeqw== X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:satlexmb08.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(1800799024)(82310400026)(7416014)(36860700016)(376014)(56012099003)(22082099003)(18002099003)(13003099007);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: urtwY+ECSojrIoDL8p/qnzyzFQDDqu2nHaI3r/3G91PebpH5werM+Ccf5Il0zXC/u9WmgpF6pJPI9SMI50snD9OjW89CLynTGXTVBwTmAoyhpZgqnGFJq0Kg6hMA7nMv/H1nNmRfM3yIuWjh3Sj0RBmsQL/Oj5Kdq7zAVFs1qT60A8lYs3kmN7mdjoZbvT/ra8nlOa+HZyL2EtXYn8pa+b7A4vt/H+91sXnl77KuB8LuGMBSHYLrjRIkMDDEzh/hA4ch4qI2gqpzjvSL7kADwG8YG++itywih70BhcawUpC7Y9k94O4X8PtmUfLxvlPKyGlOIB1xkuJu0u7vM7di6yRlSCU8wUwuvosuUxSp5npeDcdgaFbFbmtv4Re+sYZVr+PhpszabQQJmjpGEH1xHMToVuYwU1zK9QAHkec5F+bePo3flIoNCSKpIftWAbxZ X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 30 Mar 2026 04:46:27.2201 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: e4b5f041-b1a2-4884-2d4c-08de8e174dc8 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[satlexmb08.amd.com] X-MS-Exchange-CrossTenant-AuthSource: SJ5PEPF00000206.namprd05.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: DS2PR12MB9822 Replies to review from Shashiko [https://sashiko.dev/#/patchset/20260323095104.238982-1-bharata%40amd.com] >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >> index b298cba853ab..fe957ff91df9 100644 >> --- a/mm/huge_memory.c >> +++ b/mm/huge_memory.c >[ ... ] >> @@ -2217,11 +2218,26 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf) >> goto out_map; >> >> nid = folio_nid(folio); >> + needs_promotion = folio_is_promo_candidate(folio); >> >> target_nid = numa_migrate_check(folio, vmf, haddr, &flags, writable, >> &last_cpupid); >> if (target_nid == NUMA_NO_NODE) >> goto out_map; > Does this code inadvertently record pages forbidden from migrating in pghot? > If numa_migrate_check() returns NUMA_NO_NODE due to memory policy, the code > jumps to out_map. At that point, nid retains its initialized value of > folio_nid(folio). Since nid != NUMA_NO_NODE evaluates to true, it looks like > pghot_record_access() will still be called at the end of the function, which > could pollute the tracking map and trigger useless daemon sweeps. While for NUMAB1, we need to call task_numa_fault() to just record the fault stats, the equivalent of it for NUMAB2 isn't necessary. I will make the call to pghot_record_access() conditional to this scenario. >> + >> + if (needs_promotion) { >> + /* >> + * Hot page promotion, mode=NUMA_BALANCING_MEMORY_TIERING. >> + * Isolation and migration are handled by pghot. >> + * >> + * TODO: mode2 check >> + */ >> + writable = false; > Can this cause a regression involving spurious write-protect faults? By forcing > writable = false here before jumping to out_map, the write permission appears > to be stripped from the restored mapping even if the page was previously > writable. If the application access was a write, it might immediately suffer a > spurious write-protect fault when it retries the instruction. >> + nid = target_nid; >> + goto out_map; >> + } Since there is no migration from the fault context for NUMAB2, we don't release PTL lock at all. Hence I realize that I don't have to modify writable and ignore_writable at all for the hot page promotion case. >[ ... ] >> diff --git a/mm/memory.c b/mm/memory.c >> index 2f815a34d924..289fa6c07a42 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >[ ... ] >> @@ -6086,16 +6087,31 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) >> goto out_map; >> >> nid = folio_nid(folio); >> + needs_promotion = folio_is_promo_candidate(folio); >> nr_pages = folio_nr_pages(folio); >> >> target_nid = numa_migrate_check(folio, vmf, vmf->address, &flags, >> writable, &last_cpupid); >> if (target_nid == NUMA_NO_NODE) >> goto out_map; > Similarly to do_huge_pmd_numa_page(), could this also inadvertently record > pages forbidden from migrating when jumping to out_map with an unmodified > nid? Same as above. >> - if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) { >> + >> + if (needs_promotion) { >> + /* >> + * Hot page promotion, mode=NUMA_BALANCING_MEMORY_TIERING. >> + * Isolation and migration are handled by pghot. >> + */ >> + writable = false; >> + ignore_writable = true; > Would forcing writable = false and ignore_writable = true here also strip > permissions and cause a performance regression from spurious write-protect > faults for previously writable pages? Same as above. >> + nid = target_nid; >> + goto out_map; >> + } >> + >> + /* Balancing b/n toptier nodes, mode=NUMA_BALANCING_NORMAL */ >> + if (migrate_misplaced_folio_prepare(folio, vmf->vma, target_nid)) { >> flags |= TNF_MIGRATE_FAIL; >> goto out_map; >> } >> diff --git a/mm/migrate.c b/mm/migrate.c >> index a5f48984ed3e..db6832b4b95b 100644 >> --- a/mm/migrate.c >> +++ b/mm/migrate.c >[ ... ] >> @@ -2796,6 +2808,8 @@ int migrate_misplaced_folios_batch(struct list_head *folio_list, int node) >> #ifdef CONFIG_NUMA_BALANCING >> count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded); >> count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded); >> +#endif >> +#ifdef CONFIG_PGHOT >> + mod_node_page_state(NODE_DATA(node), PGPROMOTE_SUCCESS, nr_succeeded); >> #endif >> } > Should this use mod_lruvec_state() instead of mod_node_page_state() to > accurately update both the node and memcg counters? PGPROMOTE_SUCCESS is > defined in memcg_node_stat_items, so it is exported to the cgroup > memory.stat interface. Using mod_node_page_state() bypasses memcg accounting, > which might cause the cgroup-level promotion statistics to silently underflow > or remain at zero. Replied to this in one of the previous threads. >> diff --git a/mm/pghot.c b/mm/pghot.c >> index 7d7ef0800ae2..3c0ba254ad4c 100644 >> --- a/mm/pghot.c >> +++ b/mm/pghot.c [ ... ] >> @@ -141,6 +166,110 @@ int pghot_record_access(unsigned long pfn, int nid, int src, unsigned long now) >> return 0; >> } >[ ... ] >> +static bool kmigrated_should_migrate_memory(unsigned long nr_pages, int nid, >> + unsigned long time) >> +{ >> + struct pglist_data *pgdat; >> + unsigned long rate_limit; >> + unsigned int th, def_th; >> + unsigned long now_ms = jiffies_to_msecs(jiffies); /* Based on full-width jiffies */ >> + unsigned long now = jiffies; > Could this cause spurious threshold adjustments during millisecond > wrap-arounds? jiffies_to_msecs() returns a 32-bit unsigned int which is > zero-extended into the 64-bit unsigned long now_ms. When computing time > differences like now_ms - start (where start is an unsigned int) in > kmigrated_promotion_adjust_threshold(), the subtraction is performed in 64-bit arithmetic. > When the 32-bit counter wraps around, the subtraction could yield a very large > value that falsely satisfies conditions like > MSEC_PER_SEC or > th_period, > which could cause a regression in rate-limiting. It could wrap around but due to cmpxchg updating pgdat->nbp_rl_start to post-wrap value, it should be self-correcting. However I can change now_ms unsigned int to be perfect. >> + >> + pgdat = NODE_DATA(nid); >> + if (pgdat_free_space_enough(pgdat)) { >[ ... ] >> + >> + return !kmigrated_promotion_rate_limit(pgdat, rate_limit, nr_pages, now_ms); >> +} [ ... ] >> @@ -218,6 +347,11 @@ static void kmigrated_walk_zone(unsigned long start_pfn, unsigned long end_pfn, >> goto out_next; >> } >> >> + if (!kmigrated_should_migrate_memory(nr, nid, time)) { >> + folio_put(folio); >> + goto out_next; >> + } >> + > Does this correctly advance the PFN when encountering large folio tail pages? > Looking at the rest of kmigrated_walk_zone(), the loop iterator pfn is > advanced by nr = folio_nr_pages(folio) at the out_next label. > If the loop lands on a tail page of a large folio (for example, if a > previous iteration failed a check and incremented by 1), folio_nr_pages() > returns the size of the entire large folio. Adding the full folio size to a > tail page's PFN overshoots the end of the folio, potentially skipping valid > pages of subsequent allocations. > Would it be safer to advance by > folio_nr_pages(folio) - folio_page_idx(folio, page)? We could end up on tail pages leading to skipping of some folios but I think they will be reached in the next pass. Anyway I will check if your suggestion can be incorporated without any additional overhead. Regards, Bharata.