From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 62D61CA0EED for ; Thu, 28 Aug 2025 10:51:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AFA076B00A3; Thu, 28 Aug 2025 06:51:02 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id AAA896B00A4; Thu, 28 Aug 2025 06:51:02 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8FDEB6B00A5; Thu, 28 Aug 2025 06:51:02 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 76BEB6B00A3 for ; Thu, 28 Aug 2025 06:51:02 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 21C2F85492 for ; Thu, 28 Aug 2025 10:51:02 +0000 (UTC) X-FDA: 83825848764.17.524A52C Received: from mx0b-00069f02.pphosted.com (mx0b-00069f02.pphosted.com [205.220.177.32]) by imf14.hostedemail.com (Postfix) with ESMTP id 9813C100009 for ; Thu, 28 Aug 2025 10:50:58 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=oracle.com header.s=corp-2025-04-25 header.b=cG6kqENa; dkim=pass header.d=oracle.onmicrosoft.com header.s=selector2-oracle-onmicrosoft-com header.b=yqKIEV51; dmarc=pass (policy=reject) header.from=oracle.com; arc=pass ("microsoft.com:s=arcselector10001:i=1"); spf=pass (imf14.hostedemail.com: domain of lorenzo.stoakes@oracle.com designates 205.220.177.32 as permitted sender) smtp.mailfrom=lorenzo.stoakes@oracle.com ARC-Seal: i=2; s=arc-20220608; d=hostedemail.com; t=1756378258; a=rsa-sha256; cv=pass; b=II+eohSpi8oR03E/jonAhVGZqTle0E8pYzG8Pdc751Ovzyy50PoxcRJSOVLXKU0+xeOdDQ KvJ3ytRAWrxyV2b0AbFD6FAUDzOONi5P4k+QrpfPBgcaMsxPcEZ/VGC3tQK/abx7V+TqlJ yYWbeWAHhsQYiaDf7D7qq9wi13s0rBk= ARC-Authentication-Results: i=2; imf14.hostedemail.com; dkim=pass header.d=oracle.com header.s=corp-2025-04-25 header.b=cG6kqENa; dkim=pass header.d=oracle.onmicrosoft.com header.s=selector2-oracle-onmicrosoft-com header.b=yqKIEV51; dmarc=pass (policy=reject) header.from=oracle.com; arc=pass ("microsoft.com:s=arcselector10001:i=1"); spf=pass (imf14.hostedemail.com: domain of lorenzo.stoakes@oracle.com designates 205.220.177.32 as permitted sender) smtp.mailfrom=lorenzo.stoakes@oracle.com ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1756378258; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=X9QYOPadtEcK/RZ0qJyl+YgGbHqy64H6Xe8LJ3LSb8M=; b=TSHK3xpd/RH+i18FZ6MQcA6lZhggDlJ0fQgmCSRHuxKnr0BjC/qprLARDUFkiOlBFWRhf4 Ok0aD6zbw/6NInr8Mdv9Z8QjbtD73IXutYSXcMWEHrcXtnddI8aG3vuv4+jhktnsSBlMPf UznbSbsx6bFD6SLk5/qNlN6ymoyijg4= Received: from pps.filterd (m0246632.ppops.net [127.0.0.1]) by mx0b-00069f02.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 57S8tmHP029314; Thu, 28 Aug 2025 10:50:28 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to; s= corp-2025-04-25; bh=X9QYOPadtEcK/RZ0qJyl+YgGbHqy64H6Xe8LJ3LSb8M=; b= cG6kqENaeTifFEkssNakVYgdlNxLslUmLdw9cHq6zSpSTGfI1gNCtnfMIXLhRM0r zytNFJlrkNdovoEUg8/1n02gmQt/pzFyBLcgTgukvwOUcOrVjplPO8hglNRHe6Vp c4Enhd0EBEROvFW2ZhM5vlz++nRLb7AP18HmfDBwhyHcy4bRqRCiaixN+AxPG2+M xWsSOtrc9dmBRVn104OLFAuV9OkvyqO5VNqEZYdjnxkRVJowZAtUNiZWK6+fAOTi IFb7bdQUW0luxlv1P2JZuXhZ5LY+bpqTRD0PsR1sPvlJS1tCdNkkulwrMdYgonBf fuA7xbfeFlrzZIpRQB5ajg== Received: from phxpaimrmta02.imrmtpd1.prodappphxaev1.oraclevcn.com (phxpaimrmta02.appoci.oracle.com [147.154.114.232]) by mx0b-00069f02.pphosted.com (PPS) with ESMTPS id 48q58s87q2-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 28 Aug 2025 10:50:27 +0000 (GMT) Received: from pps.filterd (phxpaimrmta02.imrmtpd1.prodappphxaev1.oraclevcn.com [127.0.0.1]) by phxpaimrmta02.imrmtpd1.prodappphxaev1.oraclevcn.com (8.18.1.2/8.18.1.2) with ESMTP id 57S9o1jM027083; Thu, 28 Aug 2025 10:50:26 GMT Received: from ph0pr06cu001.outbound.protection.outlook.com (mail-westus3azon11011014.outbound.protection.outlook.com [40.107.208.14]) by phxpaimrmta02.imrmtpd1.prodappphxaev1.oraclevcn.com (PPS) with ESMTPS id 48q43bm0ph-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 28 Aug 2025 10:50:26 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=CAr1ydTW299GZz05HxXBrLOddklIgyqtRW7bJsU6nzDVg7knvE01y8brRyR3modDZTFK1qaS2ZWI1EW+Oh4l4bs9sZBUH35KhMe3qSgxJomHVlspy8eeVeqBBWqaiJ4SDE3s4AGiTGsakHEuHqgGClJBROvx1610R4O6luuJjsdqRTIvwSkXxlrtHqRFI6RfMl0zx32361w5su2ygplmErj642LHoO8PunUS75rZeNwSbZR7mzlcsLwTh+iAj2lfxgr7ZPUIsRHXgRKH3eyiZ8I7H7xJcWhUzgS9++WT+sydzgWBIVgGcpmk32ajUd6LEAr7CqPX/CPXxQA8JrUKbQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=X9QYOPadtEcK/RZ0qJyl+YgGbHqy64H6Xe8LJ3LSb8M=; b=TK1LVXdhdg/ALWR2aQWxRUXCMXap2WWksS/0Nt+ifAdyVAAHxWDzbXewVqHq6SgOTXOxlq3+Rv9wLKSgUnIS8x2rcd45QaqDn9vnPEyWrEGry53DlMKYIqDxnZyHxScgsmwsjceOjEJoiJcNpeTsh2eknfjII3KtDCed4xZkzKu6jqaEyHAso6h1QotJW2ZyZVgiGE+xcGGmVvEwoSlGUa7jm+x2qyNfHaWHew7h5rMlqixUfn9iwNYJP9xNuTTMTRJxHyDKKClsSxccy3441Geelr9AOBMlkiajWAm5t5l7TJzURpDmnd25n5rqTrVG9HOfeMoUHN6ypidqAAYwug== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=oracle.com; dmarc=pass action=none header.from=oracle.com; dkim=pass header.d=oracle.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.onmicrosoft.com; s=selector2-oracle-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=X9QYOPadtEcK/RZ0qJyl+YgGbHqy64H6Xe8LJ3LSb8M=; b=yqKIEV51tjdRsQBLU76KI0O/kLfeKA+hqjufYNzjQcxOwD4fsAaWOS4TEddGwHSrAvJ2QN4XgooLXD+LQfJP0M2B3SEDfwr7ih2nEsJGtDXANIExiG5ch1K7amuwaLV7j9d+CJoL15pD5w93MJvnzo7Up77glekPLHgv1EJrw6M= Received: from DM4PR10MB8218.namprd10.prod.outlook.com (2603:10b6:8:1cc::16) by IA0PR10MB7136.namprd10.prod.outlook.com (2603:10b6:208:409::9) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9052.20; Thu, 28 Aug 2025 10:50:19 +0000 Received: from DM4PR10MB8218.namprd10.prod.outlook.com ([fe80::2650:55cf:2816:5f2]) by DM4PR10MB8218.namprd10.prod.outlook.com ([fe80::2650:55cf:2816:5f2%5]) with mapi id 15.20.9052.019; Thu, 28 Aug 2025 10:50:19 +0000 Date: Thu, 28 Aug 2025 11:50:16 +0100 From: Lorenzo Stoakes To: Yafang Shao Cc: akpm@linux-foundation.org, david@redhat.com, ziy@nvidia.com, baolin.wang@linux.alibaba.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com, hannes@cmpxchg.org, usamaarif642@gmail.com, gutierrez.asier@huawei-partners.com, willy@infradead.org, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, ameryhung@gmail.com, rientjes@google.com, corbet@lwn.net, bpf@vger.kernel.org, linux-mm@kvack.org, linux-doc@vger.kernel.org Subject: Re: [PATCH v6 mm-new 01/10] mm: thp: add support for BPF based THP order selection Message-ID: <80db932c-6d0d-43ef-9c80-386300cbeb64@lucifer.local> References: <20250826071948.2618-1-laoar.shao@gmail.com> <20250826071948.2618-2-laoar.shao@gmail.com> Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-ClientProxiedBy: CWLP123CA0006.GBRP123.PROD.OUTLOOK.COM (2603:10a6:401:56::18) To DM4PR10MB8218.namprd10.prod.outlook.com (2603:10b6:8:1cc::16) MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: DM4PR10MB8218:EE_|IA0PR10MB7136:EE_ X-MS-Office365-Filtering-Correlation-Id: 2272602a-303a-4cdb-0f80-08dde620adff X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|7416014|376014|366016|1800799024|7053199007; X-Microsoft-Antispam-Message-Info: =?utf-8?B?TkZhb051bXBYclFLQUkrRjlFMVUzenVBUTljZG5lMzZ3R0dnVi9zTmp3LzJ6?= =?utf-8?B?OVVuSDFzVENLYVJWNGRteUxkM2RmWVhkS0FMVXRvUVlaREdObFlqQmJxcGZl?= =?utf-8?B?eGt4ZFNUckxmL285aEYzODJXRXRKSm94M0R1b2ZoVklQWVdpdXV2MXg2Vkh5?= =?utf-8?B?dld2M1h0bzNkVDRLOGExODdHbUtSbFJJMUdueVhjSWVlUlNmTnpjN1VFd1U2?= =?utf-8?B?NStYZ1dhTWFSQ0lscVdKdHA3ZnZZcGx3cklNUEEybEs0eG5waDVqOG1lTStK?= =?utf-8?B?T2ZzUmtKZlZnV1dBNktQQmQycjFCbDZBNGtpc0xybmNWRVR3OWZSb3c0QW1w?= =?utf-8?B?cjJDL1JLalIyUjR2cnJFcmFqVEFzQXRRRmorTFVhWTZRRTFIdDBiQm01MnJ2?= =?utf-8?B?YjhHa25uUGIxbTlzY1dEa2s1UTJXNjZUeGVZbURBeXF0WWkrMFVpemRlZVVQ?= =?utf-8?B?NXhVejk2cFJVdUI5bGw0OVRuNFNBdW45WEZBSjRsb1RsRzc5ODdNWnpVOUdZ?= =?utf-8?B?RkFtc3REd3VOd2t4cXlhQWJYeTFVV3YwcDhuYVRnaVRzR2tjTTcwSjFLUjNT?= =?utf-8?B?bElJYTVUc1lqclR1YitVZnBFcVpRcDJ3Q2dBUUdVTVBWaWt2aXRuOE84Zmsw?= =?utf-8?B?N1QwcEViUUF5QWVMUUQzTGdUUEM1OE5QaitaelpFZWQrRkRnMU1kLzVIZnpz?= =?utf-8?B?OUwzZ1NhM2Y1SWlLOG81dUo4K0dmd3ZzS2VsdGhhYmlOaXI3ejU5K0hUak5t?= =?utf-8?B?VnNscVZGZzd4dlJmeGJxMTZKMEl4VEdNWXFGMVF1ZGhWeU1qQXFieks2MjRN?= =?utf-8?B?cThncHVsaG1LS0ZJaStBbmIyUmxRWkRpZlpxR2ZhNGF6WDZkdTJCeUsvamUw?= =?utf-8?B?dC93ekV6L0dBeXA3OVQ1M3RlQmY1d29vWWU3ZUtReWUxNm90ajFsVmNjbmkx?= =?utf-8?B?alg3Qmk1RnBkNU14ZEFaVGdZOW9oYmhHZ0o2RVdlcWhvNzBnTXIrR0NPMjU5?= =?utf-8?B?MkZPRit3Y0dNM1I3aDlqZ04wK0YwMzBVTWhqU2tkbFgwelpCMEYrL2I4c05v?= =?utf-8?B?T1RlalFaeVMxaU5FU0tBbGlDb1lsODI3ZDRNRkd5YnQvMnF4dzc4aldncDdK?= =?utf-8?B?bE1aSUtQbVdtdFB0UUNZR0dSbGpwaGtsVTkyM1VUMXRZamdNQmRsUzAzYzdm?= =?utf-8?B?Yzk2UWNudDBINHBTdjQ4V290SGhXMThsSVBLMWlDeWw1Qm02a2g5eFJQdnND?= =?utf-8?B?VVhyN2MrWGFlSUdhcFNOU0c2Titjb0cxV1VWUFc4SUttNWpWSlNySlVZL2tY?= =?utf-8?B?VDFXWXdFd3pueGJhS1ZVMTFPeHhVU3VJQnZYQ1JlMlF3V0NlemgwbVU5ejNy?= =?utf-8?B?UCtVMm44WHlMNmpJMmtXYURBMlhyUWFteUZuZEtURHM4NG9zcDhCQ0duK1Vk?= =?utf-8?B?b3dZeXNYOHpvSmluRXBKWGZIZG5udnJNM000VWhxYm9hWU0yNUFSL0l3MS9s?= =?utf-8?B?RXFiWGZHN1hyOTFaWWh2QVhDL2JrR1FrQmllOXg3MHM2bUp2QU9ZRk5CTWVm?= =?utf-8?B?aGJva1FteTd4MVJJNDd5aXFUT0lCeFNhcnQyaHZYekJ3cVRyMUJnZFF5ZWE3?= =?utf-8?B?NmNPcFA4Mng3Z25VWjdOM3lSN01ncDZtNzdoeGJpVXRpdnhEcW1ObTBlcDVW?= =?utf-8?B?bjFNelpYalJiNlZ6c3Q3a3lCRmtsdGVXOUpqV1ZKWnZFcmpYejFkdnFzK1R0?= =?utf-8?B?TEJ6TG8yMEtqc1h0aTBJYys4di9XekZWZjAzRFU2Ym5GcTlZUFRIQmNxTjlk?= =?utf-8?Q?hCbRYIZgpaaifWb3VjSHwAOeApkAQZya6gaRE=3D?= X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:DM4PR10MB8218.namprd10.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230040)(7416014)(376014)(366016)(1800799024)(7053199007);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?utf-8?B?NjdZSStLRFprM0dJQjBWOXo1YVdoUExGN2pJbGxma0VVdHhJaTV3Y2o3azVW?= =?utf-8?B?YnpDb3N4aU55bHdYTVBMaE9xN1IwVm5vS0ExZGhGQ0p6Yk4vMDgvVnd2MlR3?= =?utf-8?B?WHlyVTVhQkdFbTZDZHZudG9Ud1E1R3Nmb3kwZkdUL25mY085T3dwRUVabXFD?= =?utf-8?B?NWdyakx3RXFra09PL1I5MWJVbTNGTkVoYnoyOHJzVDVraTMvNU5nV2MvWnVV?= =?utf-8?B?TGxIV1JOOXRCUmpRNExhVmNTOElxSllQdTJUeHRpNWkyUG9taSttVlhYd1Jo?= =?utf-8?B?QnFCRjRNa0o5QlVaRXZ6STF6SlFrNlFJL0lsMkZXWEtLaEMwNENOS0FHOHBs?= =?utf-8?B?UFVmMW1raHRwZ1hYdUhGOXdMeldVZGpqZE1ha2doR0hVM240UVl1dGZ4OHlw?= =?utf-8?B?NWNVRHFOWlUwNmgyVHVhd0p1RjhFMXZ6Yi9Fc3pmWmcwTjFLa3Y0YnNaSTdy?= =?utf-8?B?K3NTSHhxeTRoeEpYZEYySUQ2K081NFFqV1lHVE1WVyt3T0hVbzN6MHBtR1Qz?= =?utf-8?B?OFNJSHlicEFpbXhZUXJhNnJrWC9sTk56VnBYTE1tbUM4SWpJUnVHa3hzSndo?= =?utf-8?B?eVU0M1BLYWFaTXN3WjdtQWRPMktFaGRVVHNHZmNGcklHWTZGZHFpYlhjRUFN?= =?utf-8?B?U2U1RjFoZ2VtZTFpaTZ6VUdBcFgvZzFmdFZxNzEwTlIwanM4VXk5WW1iaXNY?= =?utf-8?B?cDBaZVhjYzY1U3hlK0JUdWY0cE5nNjVwdlVFcU10NHhWUVFkWGRnek1RbTlE?= =?utf-8?B?NTgzN2hBeFowcEpWdWxidUVZWExyZlJ1SUdyWXVjQVc0WHJVMUFvQ3hyTE5y?= =?utf-8?B?UkxrbkVLQ2t2bnYwWkNPeDZxRlpZNUhHbVZmaVBXa3NPblFYdFRGWDRPazBy?= =?utf-8?B?cnltOE5md01XY2VwcHBidGFFVklQQzVpVHZQUG1HbVh0a0N3MXozbmRyYzRv?= =?utf-8?B?eWNmSWNvNlFqbG05Z3dhbDNHWkRFTlk4dGRmVnpzZit3cXBlcnVmajN5dERO?= =?utf-8?B?am1xTi9VenA2VmlPNndtVS9sUXZuTkhveFlVVE9remRidHFHdmUvU1pLbHRs?= =?utf-8?B?cXAvY0dUck9TWGpyeThNclBpVmJUUFhqUkVDWTZleGRWVjF1SVdCaUUrVXdy?= =?utf-8?B?blVGY2N6ZHF6T2s0YU1MdkxpbEJseWp5UUc5OWVHMzRUM1MyVm5laGRqMkZU?= =?utf-8?B?cWtYYWFNUDk0N3REVzcwdmFjcjJVWC85ZnMwczdpNE91a0xvMHpEaDVjZFBW?= =?utf-8?B?dDNPUDJ4TzJIVFlKSWMyMWZpekdsWGtLL0w5NTRkdVdVakZ4SnNFRWFNVTlh?= =?utf-8?B?K3kxUUoyakIvMjQ2MWw5Tk5ISzQ4blNkM1FoWGZZZkUzVmFRcG84UFd3V2I5?= =?utf-8?B?L29LVVZia3llR0xFOHBkRDlBblZsNkdXdjUxbUtIOHdMTmI5RXB1SVR6dVpW?= =?utf-8?B?b0tQU1BxQmc4SnNvWmJUMUsvbXRiSmdlaUtsM0E1R1Qvc3B2TktvaCt5MWll?= =?utf-8?B?RU84WlBQbUduQzNVa1FodGUwRHU5ckkwR29FTExMQ0JFZFlKOVdKL3FRZkRR?= =?utf-8?B?YVBjZU1hZ2Z6QkZ5aWs2L3RNQUJ2M0E4bWw3S0w3NFhLcDlYN1FVOUFyTTVp?= =?utf-8?B?WGtQdWxUTXBBTXIyd2NkUHNrb0VtZTU5WlNuRHJzNXJ3Y0pubVZWZElqSExm?= =?utf-8?B?dkVONUZnQ3BwYmlkZWx6OW1PMVFQS08xbWNOR1hqakdja3g1R21hTEcwSStE?= =?utf-8?B?TDFZeEExT1Qvb0taQmt1SzRmdERkRzRudU1aT0g3TUROT0c5a3JBaWVPZkVq?= =?utf-8?B?WCtZczNvM0ZLQU1kWlRxZW1xaiszNTkzckU5SG1nZ3R6dDV2aVJIRDdvbS9n?= =?utf-8?B?MFVuYXVSaldlbzRGZHBYOGJPWTZCT0kyd2VTTTdqeTA1aXoweUxRR09DdCs4?= =?utf-8?B?ZndsVEtJTzFueVIxcGZxK0d6aWE3RHYrZUd1clNqME5MTTlCWDVjRWpuZXpC?= =?utf-8?B?WjBMVjdCT0doN2NvY0hTWGV4MHRvZDcrOVN3SjVxTWF0NjZOaGYyd1ZlbUVI?= =?utf-8?B?OEdkZ2RxeHI1ZGczUlVYNmIrazJNU2pzb0lkbnMvY2lsZlkvbFFlWnhHeERO?= =?utf-8?B?WEJzRmtzUzRxVE9lRThHcWpIV2FJNHg1Q2pndUkxeWNvR0NCQjFXRXRBSXNk?= =?utf-8?B?eGc9PQ==?= X-MS-Exchange-AntiSpam-ExternalHop-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-ExternalHop-MessageData-0: F2B6KayVZhjKGt5j7QKFm1XJ7V15aFiE/qxSFmRmCncsUyGhyUgAnlG8Kf81wq8wr7nJVBlch/PBrGNlsptGxE/88CZbDR5c0GbCg/dF+zNsm76+ZfssmqMiu9wJUEDf1VvkFcEMZGKncE4NuAhX2gQt07ZL9tH+ddGz5ZI7zbknBepURthZ7A0e7u6LQQZ5Yz2YeIxRtiPJpJECtoRCq+Jz0yRXYdrRPF77WXQjx6evN9cAcOzGODZLIKFk51I1joBDHdCJpwhAJj2tSN1qIyIsRTDnxNEpLvw0wTZ/j2yuhnrPkz3zSHMgsg5+aAV9VMWpm7G4VFAXAp3OCNrxeFByaWThqhHICavJfwAfVuvxO3b+m49TMq2zTUkBc+iVyWFtiqWisFFO0OUnPjM1GBFOQEUCgRvJh3Kr4D6RPWfKkBUBl4DHy4toavt1L2xwV9WPdrwCE8eGJ8g5xAiE3g2cpwIDDbScnDb3rdO8zAk+rM2mk2d/49tm0nanmb2vbDrPBcU7J+GLwMVWNfq0CeUaXuLZR3v4j7AOJ47aHvQIy2mNLAWidT837u0uPCZxcImPkQNS3Bxxz5Puaa4nLVs2Idqec9SWxfvl147R63Y= X-OriginatorOrg: oracle.com X-MS-Exchange-CrossTenant-Network-Message-Id: 2272602a-303a-4cdb-0f80-08dde620adff X-MS-Exchange-CrossTenant-AuthSource: DM4PR10MB8218.namprd10.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 28 Aug 2025 10:50:19.1148 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 4e2c6054-71cb-48f1-bd6c-3a9705aca71b X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: U4CVEqjg6GMHpsIFnj/yqUHiuh/EdoaXvS/+qanWWj3miEmVf+sNQgt+IjWff+ViZyFK7J/XIziRk+7IbKBoUzJ+bN1S6xH5XMlEvcrfcwA= X-MS-Exchange-Transport-CrossTenantHeadersStamped: IA0PR10MB7136 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1099,Hydra:6.1.9,FMLib:17.12.80.40 definitions=2025-08-28_03,2025-08-28_01,2025-03-28_01 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 bulkscore=0 malwarescore=0 adultscore=0 phishscore=0 suspectscore=0 mlxlogscore=999 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2508110000 definitions=main-2508280090 X-Authority-Analysis: v=2.4 cv=J6mq7BnS c=1 sm=1 tr=0 ts=68b03474 cx=c_pps a=OOZaFjgC48PWsiFpTAqLcw==:117 a=OOZaFjgC48PWsiFpTAqLcw==:17 a=6eWqkTHjU83fiwn7nKZWdM+Sl24=:19 a=z/mQ4Ysz8XfWz/Q5cLBRGdckG28=:19 a=lCpzRmAYbLLaTzLvsPZ7Mbvzbb8=:19 a=wKuvFiaSGQ0qltdbU6+NXLB8nM8=:19 a=Ol13hO9ccFRV9qXi2t6ftBPywas=:19 a=xqWC_Br6kY4A:10 a=IkcTkHD0fZMA:10 a=2OwXVqhp2XgA:10 a=GoEa3M9JfhUA:10 a=07d9gI8wAAAA:8 a=20KFwNOVAAAA:8 a=yPCof4ZbAAAA:8 a=pGLkceISAAAA:8 a=EUKqZ3xtgp5BXCl0uf8A:9 a=3ZKOabzyN94A:10 a=QEXdDO2ut3YA:10 a=e2CUPOnPG4QKp8I52DXD:22 X-Proofpoint-GUID: 2grRuPNC379gAs3s3UWVD8I1ExjhsYn8 X-Proofpoint-ORIG-GUID: 2grRuPNC379gAs3s3UWVD8I1ExjhsYn8 X-Proofpoint-Spam-Details-Enc: AW1haW4tMjUwODIzMDAyNyBTYWx0ZWRfXwXttoAEAn3MV Xc6f3zSafgMOal4jTuxXCr8asoG/okPnhpV8FTz5cfMVAcXMt9VcW/ATUJn3AcqAaNuUuekMro6 AB8KInJbWGRytpd8n/MKEi+f8nYUJ+YMurgNlFreU/kAUbLsrTHpiCk83yesAq95EJUpNmbzJLG uBIF1oWpu7AqXJmGfXxYZIJDdO5tYJjZCNiTYT635GRRGUhkXtjtD82imR0/RjxrHtb6R9BpsBU R/2Z28KXrTMQIIpiiHwWYwzWZfFVUCsFBy69gLAN5nZyYyptoZiZe9V/nMsMIbdUr8P26LjcYxO ORm3Mau7KnXvLkJTmsacBXe3T/YUmyIKnr9g/O8R0QwPKM0xiF7UvQzSI4WHdM4so8vFnlHOyB3 Lvs51krP X-Rspamd-Server: rspam03 X-Rspam-User: X-Rspamd-Queue-Id: 9813C100009 X-Stat-Signature: a5p3mhpgbowsiqbynrxdifx1mwojgptx X-HE-Tag: 1756378258-977774 X-HE-Meta: U2FsdGVkX1/QovnMboNetmDIAiWX6KyBsSZjHpBHA+ZUqrIkgZUz5h8553ifKsnxKXersupZ2E97jejxPv73ndevNSMTLrd6jUikbfjAZ26vE30JW7gA4fCA4drs7F9anm5/V0SWidMyy3TkxZAVM/TWxzCGkHicjGzCyO6Wnnvf7CI35RmSAMczPQ20UgAhHh/L1/uogHZpJHPTF2C8ra5ybZuPT1gmSAbP9qX7cGmvVzqXlw0vwZ2vZhnTd48UfLL0uPLQnEWcifhGoYHZtHaf6tboftHl/2a/IAxEOF3vcdQoEjtcDsMet306zfInRMhyWnh2V/SVZ4S3ynG+pYYimvF3TRz4iZ1tVUKrqu4KZbjYrxcOTaFy2OwefvDF6pFI3WHba/+UykBef7Vo9nBSJKzdhLxOgSBFLm5fRkoGGLFU1F47rhOFnlPCy9pAqZHt54ImkhfJR1B0hfvonJCTYUfzwbvNfh/e2BAiKOhA7DURGntkQ60jlcCmRjsWbHeYkp7mGX14Ou1o8OMR7+d4YR5qd5Ph6Sc0qVmRkDb0a57E7wJgG81J/eexDR3cSWLFzfzy829+xP7J2R6V3+22kObXLm6DZO963smIgcau851mvLJi3WLCbysljusIhryuRuNHJTHKeq2J5eiqqR+q5XgilMjXGry1Uq2j9TrRt4UIieP8SjXBPSDvh9YU0VkQKZ01tTDk9Xgxq2wgbjVzQCsuCJQtZXPJseNkDbZoLSFOtVze5qPj3IPkXuxYL9shHopMeUsZ38dZlCWT4Tnu2Na+asY24JjZuntReXBHo3tUwoNKzj+5mqjnYNoBOF1foqYJbyLTRIybMPKxrADwM+fqCB+TxtKT7gVpsXu5DgAv+QqktOoi1x6BpukzOP7onU70DUa7XRI66cKlxExWwq/r4ilABwsg+ymq1oyNuaK1BbzCkmhag1S6TuZdqrfMs8xKvOnKDRiXllb 8p381QAb /jIorfe19F4Yh6WRHnYfntiftRhN+Rzj8yh9/tJIZjDRPL37Ws3E/50wQkkQttMIK2kaMxJvtPaYF+6DCrgDe5We+GOhRqVnVOm0GsY7HwQsyk97q+tVN0bH4rhQNrj7opN4hO25KBunZPPabxq593kepqIdAR8ozo0JIVWvBcXzxipSD/hW4CxnLg9Uur/jBWANHZto3VQW4G5ZTCAJI5mh8i6rtV/Zj0lT+vOpcpZ3uki1/5/h/hwTN6Ldcw3+/ioXc1h+ER0yunWbFMr+MzXHP1M4yD8EwJ1N2VX/0L6W3v29w2ZcaHL3YQ4EGntlH/N7WgqjZWpwpSYCJeDumzvhIEGvTbuKf6QN/CUU0Kd3JAgpcd9a3qEWo3ivOZccPa2G40gojC3frkPdmmgRi6Qdq2/AP1gcMw8wBhXcPl/wzDo06PBPGoIQPuO5iychV6nBC9RSN6bt1oRBicne74mpFdf3EPPKXTMAioyK2Dr7MAVin5RE14A7zJQWypkK4dIRzVhMDiAqrU2X8+q+/ec7CmMqApOWXljBbUW436CprgjM2EyHuO9b1oHgFxF6EPp0PwEocK8hDKqpohedIjBxhmq0kKEB3v6QxCFrydgZDtZgLdZILM8c7FogPiYK65/wLoUAlbaxAWwP1+2qPw/HywSBMW2wok8vOEULi/+zSsH7/YRWWH3UWx/0I011UmMD7bIXckmhYgi/TfLXmywijBkz/jkQLwEj64mOo9Z8YWUr2P3JGHZFyod8sxOlG89cwDQEDuZjER/xSjMVICnCIwsVs+moctnKbjYH8Dkh/OfcG5O7patyrNZrUVQhMK0qLtXvsjwDMrm77dL2RP7QobfK1ZxAZhOcDP6VKlVc6YfNsqoOB+akGp02pChZU/4pPAlOhkVFvlTpmQhLSjyn60SZ6Dd+f9rCQKahNRAIySFGU5xesC2/hKA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Aug 28, 2025 at 01:54:39PM +0800, Yafang Shao wrote: > On Wed, Aug 27, 2025 at 11:03 PM Lorenzo Stoakes > wrote: > > > > On Tue, Aug 26, 2025 at 03:19:39PM +0800, Yafang Shao wrote: > > > This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic > > > THP tuning. It includes a hook get_suggested_order() [0], allowing BPF > > > programs to influence THP order selection based on factors such as: > > > - Workload identity > > > For example, workloads running in specific containers or cgroups. > > > - Allocation context > > > Whether the allocation occurs during a page fault, khugepaged, or other > > > paths. > > > - System memory pressure > > > (May require new BPF helpers to accurately assess memory pressure.) > > > > > > Key Details: > > > - Only one BPF program can be attached at a time, but it can be updated > > > dynamically to adjust the policy. > > > - Supports automatic mTHP order selection and per-workload THP policies. > > > - Only functional when THP is set to madise or always. > > > > > > It requires CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION to enable. [1] > > > This feature is unstable and may evolve in future kernel versions. > > > > > > Link: https://lwn.net/ml/all/9bc57721-5287-416c-aa30-46932d605f63@redhat.com/ [0] > > > Link: https://lwn.net/ml/all/dda67ea5-2943-497c-a8e5-d81f0733047d@lucifer.local/ [1] > > > > > > Suggested-by: David Hildenbrand > > > Suggested-by: Lorenzo Stoakes > > > Signed-off-by: Yafang Shao > > > --- > > > include/linux/huge_mm.h | 15 +++ > > > include/linux/khugepaged.h | 12 ++- > > > mm/Kconfig | 12 +++ > > > mm/Makefile | 1 + > > > mm/bpf_thp.c | 186 +++++++++++++++++++++++++++++++++++++ > > > > Please add new files to MAINTAINERS as you add them. > > will do it. > > > > > > mm/huge_memory.c | 10 ++ > > > mm/khugepaged.c | 26 +++++- > > > mm/memory.c | 18 +++- > > > 8 files changed, 273 insertions(+), 7 deletions(-) > > > create mode 100644 mm/bpf_thp.c > > > > > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > > > index 1ac0d06fb3c1..f0c91d7bd267 100644 > > > --- a/include/linux/huge_mm.h > > > +++ b/include/linux/huge_mm.h > > > @@ -6,6 +6,8 @@ > > > > > > #include /* only for vma_is_dax() */ > > > #include > > > +#include > > > +#include > > > > Hm this is a bit weird as mm.h includes huge_mm... I guess it will be handled by > > header defines but still. > > Some refactoring is needed for these two header files, but we can > handle it separately later. > > > > > > > > > vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf); > > > int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, > > > @@ -56,6 +58,7 @@ enum transparent_hugepage_flag { > > > TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, > > > TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG, > > > TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG, > > > + TRANSPARENT_HUGEPAGE_BPF_ATTACHED, /* BPF prog is attached */ > > > }; > > > > > > struct kobject; > > > @@ -195,6 +198,18 @@ static inline bool hugepage_global_always(void) > > > (1< > > } > > > > > > +#ifdef CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION > > > +int get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable, > > > + u64 vma_flags, enum tva_type tva_flags, int orders); > > > > Not a massive fan of this naming to be honest. I think it should explicitly > > reference bpf, e.g. bpf_hook_thp_get_order() or something. > > will change it to bpf_hook_thp_get_orders(). Thanks! > > > > > Right now this is super unclear as to what it's for. > > > > Also wrt vma_flags - this type is wrong :) it's vm_flags_t and going to change > > to a bitmap of unlimiiteeed size soon. So probs best not to pass around as value > > type either. > > As replied in another thread. I will change it. Thanks. Will check the other thread. > > > > > But unclear us to purpose as mentioned elsewhere. > > > > And also get_suggested_order() should be get_suggested_orderS() no? As you > > seem later in the code to be referencing a bitfield? > > Right, it should be bpf_hook_thp_get_orderS(). Thanks! > > > > > Also will mm ever != vma->vm_mm? > > No it can't. It can be guaranteed by the caller. In this case we don't need to pass mm separately then right? > > > > > Are we hacking this for the sake of overloading what this does? > > The @vma is actually unneeded. I will remove it. Ah OK. I am still a little concerned about passing around a value reference to the VMA flags though, esp as this type can + will change in future (not sure what that means for BPF). We may go to e.g. a 128 bit bitmap there etc. > > > > > Also if we're returning a bitmask of orders which you seem to be (not sure I > > like that tbh - I feel like we shoudl simply provide one order but open for > > disucssion) - shouldn't it return an unsigned long? > > We are indifferent to whether a single order or a bitmask is returned, > as we only use order-0 and order-9. We have no use cases for > middle-order pages, though this feature might be useful for other > architectures or for some special use cases. Well surely we want to potentially specify a mTHP under certain circumstances no? In any case I feel it's worth making any bitfield a system word size. > > > > > > +#else > > > +static inline int > > > +get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable, > > > + u64 vma_flags, enum tva_type tva_flags, int orders) > > > +{ > > > + return orders; > > > +} > > > +#endif > > > + > > > static inline int highest_order(unsigned long orders) > > > { > > > return fls_long(orders) - 1; > > > diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h > > > index eb1946a70cff..d81c1228a21f 100644 > > > --- a/include/linux/khugepaged.h > > > +++ b/include/linux/khugepaged.h > > > @@ -4,6 +4,8 @@ > > > > > > #include > > > > > > +#include > > > + > > > > Hm this is iffy too, There's probably a reason we didn't include this before, > > the headers can be so so fragile. Let's be cautious... > > I will check. Thanks! > > > > > > extern unsigned int khugepaged_max_ptes_none __read_mostly; > > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > > > extern struct attribute_group khugepaged_attr_group; > > > @@ -22,7 +24,15 @@ extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, > > > > > > static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm) > > > { > > > - if (mm_flags_test(MMF_VM_HUGEPAGE, oldmm)) > > > + /* > > > + * THP allocation policy can be dynamically modified via BPF. Even if a > > > + * task was allowed to allocate THPs, BPF can decide whether its forked > > > + * child can allocate THPs. > > > + * > > > + * The MMF_VM_HUGEPAGE flag will be cleared by khugepaged. > > > + */ > > > + if (mm_flags_test(MMF_VM_HUGEPAGE, oldmm) && > > > + get_suggested_order(mm, NULL, 0, -1, BIT(PMD_ORDER))) > > > > Hmmm so there seems to be some kind of additional functionality you're providing > > here kinda quietly, which is to allow the exact same interface to determine > > whether we kick off khugepaged or not. > > > > Don't love that, I think we should be hugely specific about that. > > > > This bpf interface should literally be 'ok we're deciding what order we > > want'. It feels like a bit of a gross overloading? > > This makes sense. I have no objection to reverting to returning a single order. OK but key point here is - we're now determining if a forked child can _not_ allocate THPs using this function. To me this should be a separate function rather than some _weird_ usage of this same function. And generally at this point I think we should just drop this bit of code honestly. > > > > > > __khugepaged_enter(mm); > > > } > > > > > > diff --git a/mm/Kconfig b/mm/Kconfig > > > index 4108bcd96784..d10089e3f181 100644 > > > --- a/mm/Kconfig > > > +++ b/mm/Kconfig > > > @@ -924,6 +924,18 @@ config NO_PAGE_MAPCOUNT > > > > > > EXPERIMENTAL because the impact of some changes is still unclear. > > > > > > +config EXPERIMENTAL_BPF_ORDER_SELECTION > > > + bool "BPF-based THP order selection (EXPERIMENTAL)" > > > + depends on TRANSPARENT_HUGEPAGE && BPF_SYSCALL > > > + > > > + help > > > + Enable dynamic THP order selection using BPF programs. This > > > + experimental feature allows custom BPF logic to determine optimal > > > + transparent hugepage allocation sizes at runtime. > > > + > > > + Warning: This feature is unstable and may change in future kernel > > > + versions. > > > > Thanks! This is important to document. Absolute nitty nit: can you capitalise > > 'WARNING'? Thanks! > > will do it. Thanks! > > > > > > + > > > endif # TRANSPARENT_HUGEPAGE > > > > > > # simple helper to make the code a bit easier to read > > > diff --git a/mm/Makefile b/mm/Makefile > > > index ef54aa615d9d..cb55d1509be1 100644 > > > --- a/mm/Makefile > > > +++ b/mm/Makefile > > > @@ -99,6 +99,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o > > > obj-$(CONFIG_NUMA) += memory-tiers.o > > > obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o > > > obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o > > > +obj-$(CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION) += bpf_thp.o > > > obj-$(CONFIG_PAGE_COUNTER) += page_counter.o > > > obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o > > > obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o > > > diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c > > > new file mode 100644 > > > index 000000000000..fbff3b1bb988 > > > --- /dev/null > > > +++ b/mm/bpf_thp.c > > > > As mentioned before, please update MAINTAINERS for new files. I went to great + > > painful lengths to get everything listed there so let's keep it that way please > > :P > > will do it. Thanks! > > > > > > @@ -0,0 +1,186 @@ > > > +// SPDX-License-Identifier: GPL-2.0 > > > + > > > +#include > > > +#include > > > +#include > > > +#include > > > + > > > +struct bpf_thp_ops { > > > + /** > > > + * @get_suggested_order: Get the suggested THP orders for allocation > > > + * @mm: mm_struct associated with the THP allocation > > > + * @vma__nullable: vm_area_struct associated with the THP allocation (may be NULL) > > > + * When NULL, the decision should be based on @mm (i.e., when > > > + * triggered from an mm-scope hook rather than a VMA-specific > > > + * context). > > > + * Must belong to @mm (guaranteed by the caller). > > > + * @vma_flags: use these vm_flags instead of @vma->vm_flags (0 if @vma is NULL) > > > + * @tva_flags: TVA flags for current @vma (-1 if @vma is NULL) > > > + * @orders: Bitmask of requested THP orders for this allocation > > > + * - PMD-mapped allocation if PMD_ORDER is set > > > + * - mTHP allocation otherwise > > > + * > > > + * Rerurn: Bitmask of suggested THP orders for allocation. The highest > > > + * suggested order will not exceed the highest requested order > > > + * in @orders. > > > + */ > > > + int (*get_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable, > > > + u64 vma_flags, enum tva_type tva_flags, int orders) __rcu; > > > > I feel like we should be declaring this function pointer type somewhere else as > > we're now duplicating this in two places. > > agreed, I have already done it to fix the spare warning. Thanks! > > > > > > +}; > > > + > > > +static struct bpf_thp_ops bpf_thp; > > > +static DEFINE_SPINLOCK(thp_ops_lock); > > > + > > > +int get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable, > > > + u64 vma_flags, enum tva_type tva_flags, int orders) > > > > surely tva_flag? As this is an enum value? > > will change it to tva_type instead. Thanks! > > > > > > +{ > > > + int (*bpf_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable, > > > + u64 vma_flags, enum tva_type tva_flags, int orders); > > > > This type for vma flags is totally incorrect. vm_flags_t. And that's going to > > change soon to an opaque type. > > > > Also right now it's actually an unsigned long. > > > > I really really do not like that we're providing extra, unexplained VMA flags > > for some reason. I may be missing something :) so happy to hear why this is > > necessary. > > > > However in future we really shouldn't be passing something like this. > > will change it as replied in another thread. Thanks! > > > > > Also - now a third duplication of the same function pointer :) can we do better > > than this? At least typedef it. > > > > > + int suggested_orders = orders; > > > + > > > + /* No BPF program is attached */ > > > + if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, > > > + &transparent_hugepage_flags)) > > > + return suggested_orders; > > > > This is atomic ofc, but are we concerned about races, or I guess you expect only > > the first attached bpf program to work with it I suppose. > > It is against the race to unreg or update. OK cool, it does make sense overall. > > > > > > + > > > + rcu_read_lock(); > > > > Is this sufficient? Anything stopping the mm or VMA going away here? > > This RCU lock is not for protecting the mm or VMA structures > themselves, but for protecting the update of the function pointer. > Arbitrary access to pointers within the mm_struct or vm_area_struct is > prohibited, as they are guarded by the BPF verifier. > > > > > > + bpf_suggested_order = rcu_dereference(bpf_thp.get_suggested_order); > > > + if (!bpf_suggested_order) > > > + goto out; > > > + > > > + suggested_orders = bpf_suggested_order(mm, vma__nullable, vma_flags, tva_flags, orders); > > > > OK so now it's suggested order_S but we're invoking suggested order :) whaaatt? > > :) > > will change it. Thanks! > > > > > > + if (highest_order(suggested_orders) > highest_order(orders)) > > > + suggested_orders = orders; > > > > Hmmm so the semantics are - whichever is the highest order wins? > > The maximum requested order is determined by the callsite. For example: > - PMD-mapped THP uses PMD_ORDER > - mTHP uses (PMD_ORDER - 1) > > We must respect this upper bound to avoid undefined behavior. So the > highest suggested order can't exceed the highest requested order. OK, please document this in a comment here. > > > > > I thought the idea was we'd hand control over to bpf if provided in effect? > > > > Definitely worth going over these semantics in the cover letter (and do forgive > > me if you have and I've missed! :) > > It has already in the cover letter: > > * Return: Bitmask of suggested THP orders for allocation. The highest > * suggested order will not exceed the highest requested order > * in @orders. OK cool thanks, a comment here would be useful also. > > > > > > > + > > > +out: > > > + rcu_read_unlock(); > > > + return suggested_orders; > > > +} > > > + > > > +static bool bpf_thp_ops_is_valid_access(int off, int size, > > > + enum bpf_access_type type, > > > + const struct bpf_prog *prog, > > > + struct bpf_insn_access_aux *info) > > > +{ > > > + return bpf_tracing_btf_ctx_access(off, size, type, prog, info); > > > +} > > > + > > > +static const struct bpf_func_proto * > > > +bpf_thp_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) > > > +{ > > > + return bpf_base_func_proto(func_id, prog); > > > +} > > > + > > > +static const struct bpf_verifier_ops thp_bpf_verifier_ops = { > > > + .get_func_proto = bpf_thp_get_func_proto, > > > + .is_valid_access = bpf_thp_ops_is_valid_access, > > > +}; > > > + > > > +static int bpf_thp_init(struct btf *btf) > > > +{ > > > + return 0; > > > +} > > > + > > > +static int bpf_thp_init_member(const struct btf_type *t, > > > + const struct btf_member *member, > > > + void *kdata, const void *udata) > > > +{ > > > + return 0; > > > +} > > > + > > > +static int bpf_thp_reg(void *kdata, struct bpf_link *link) > > > +{ > > > + struct bpf_thp_ops *ops = kdata; > > > + > > > + spin_lock(&thp_ops_lock); > > > + if (test_and_set_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, > > > + &transparent_hugepage_flags)) { > > > + spin_unlock(&thp_ops_lock); > > > + return -EBUSY; > > > + } > > > + WARN_ON_ONCE(rcu_access_pointer(bpf_thp.get_suggested_order)); > > > + rcu_assign_pointer(bpf_thp.get_suggested_order, ops->get_suggested_order); > > > + spin_unlock(&thp_ops_lock); > > > + return 0; > > > +} > > > + > > > +static void bpf_thp_unreg(void *kdata, struct bpf_link *link) > > > +{ > > > + spin_lock(&thp_ops_lock); > > > + clear_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags); > > > + WARN_ON_ONCE(!rcu_access_pointer(bpf_thp.get_suggested_order)); > > > + rcu_replace_pointer(bpf_thp.get_suggested_order, NULL, lockdep_is_held(&thp_ops_lock)); > > > + spin_unlock(&thp_ops_lock); > > > + > > > + synchronize_rcu(); > > > +} > > > > I am a total beginner with BPF implementations so don't feel like I can say much > > intelligent about the above. But presumably fairly standard fare BPF-wise? > > This implementation is necessary to support BPF program updates. Ack. > > > > > Will perhaps try to dig deeper on another iteration :) as intersting to me. > > > > > + > > > +static int bpf_thp_update(void *kdata, void *old_kdata, struct bpf_link *link) > > > +{ > > > + struct bpf_thp_ops *ops = kdata; > > > + struct bpf_thp_ops *old = old_kdata; > > > + int ret = 0; > > > + > > > + if (!ops || !old) > > > + return -EINVAL; > > > + > > > + spin_lock(&thp_ops_lock); > > > + /* The prog has aleady been removed. */ > > > + if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags)) { > > > + ret = -ENOENT; > > > + goto out; > > > + } > > > > OK so we gate things on this flag and it's global, got it. > > > > I see this is a hook, and I guess RCU-all-the-things is what BPF does which > > makes tonnes of sense. > > > > > + WARN_ON_ONCE(!rcu_access_pointer(bpf_thp.get_suggested_order)); > > > + rcu_replace_pointer(bpf_thp.get_suggested_order, ops->get_suggested_order, > > > + lockdep_is_held(&thp_ops_lock)); > > > + > > > +out: > > > + spin_unlock(&thp_ops_lock); > > > + if (!ret) > > > + synchronize_rcu(); > > > + return ret; > > > +} > > > + > > > +static int bpf_thp_validate(void *kdata) > > > +{ > > > + struct bpf_thp_ops *ops = kdata; > > > + > > > + if (!ops->get_suggested_order) { > > > + pr_err("bpf_thp: required ops isn't implemented\n"); > > > + return -EINVAL; > > > + } > > > + return 0; > > > +} > > > + > > > +static int suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable, > > > + u64 vma_flags, enum tva_type vm_flags, int orders) > > > +{ > > > + return orders; > > > +} > > > + > > > +static struct bpf_thp_ops __bpf_thp_ops = { > > > + .get_suggested_order = suggested_order, > > > +}; > > > > Can you explain to me what this stub stuff is for? This is more 'BPF impl 101' > > stuff sorry :) > > It is a CFI stub. cfi_stubs in BPF struct_ops are secure intermediary > functions that prevent the kernel from making direct, unsafe jumps to > BPF code. A new attached BPF program will run via this stub. Ack. > > > > > > + > > > +static struct bpf_struct_ops bpf_bpf_thp_ops = { > > > + .verifier_ops = &thp_bpf_verifier_ops, > > > + .init = bpf_thp_init, > > > + .init_member = bpf_thp_init_member, > > > + .reg = bpf_thp_reg, > > > + .unreg = bpf_thp_unreg, > > > + .update = bpf_thp_update, > > > + .validate = bpf_thp_validate, > > > + .cfi_stubs = &__bpf_thp_ops, > > > + .owner = THIS_MODULE, > > > + .name = "bpf_thp_ops", > > > +}; > > > + > > > +static int __init bpf_thp_ops_init(void) > > > +{ > > > + int err = register_bpf_struct_ops(&bpf_bpf_thp_ops, bpf_thp_ops); > > > + > > > + if (err) > > > + pr_err("bpf_thp: Failed to register struct_ops (%d)\n", err); > > > + return err; > > > +} > > > +late_initcall(bpf_thp_ops_init); > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > > index d89992b65acc..bd8f8f34ab3c 100644 > > > --- a/mm/huge_memory.c > > > +++ b/mm/huge_memory.c > > > @@ -1349,6 +1349,16 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf) > > > return ret; > > > khugepaged_enter_vma(vma, vma->vm_flags); > > > > > > + /* > > > + * This check must occur after khugepaged_enter_vma() because: > > > + * 1. We may permit THP allocation via khugepaged > > > + * 2. While simultaneously disallowing THP allocation > > > + * during page fault handling > > > + */ > > > + if (get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_PAGEFAULT, BIT(PMD_ORDER)) != > > > + BIT(PMD_ORDER)) > > > > Hmmm so you return a bitmask of orders, but then you only allow this fault if > > the only order provided is PMD order? That seems strange. Can you explain? > > This is in the do_huge_pmd_anonymous_page() that can only accept a PMD > order, otherwise it might result in unexpected behavior. OK please document this in the comment. > > > > > > + return VM_FAULT_FALLBACK; > > > > It'd be good to have a helper function for this like: > > > > if (!bpf_hook_allow_pmd_order(vma, tva_flag)) > > return VM_FAULT_FALLBACK; > > > > And implemented like maybe: > > > > static bool bpf_hook_allow_pmd_order(struct vm_area_struct *vma, enum tva_type tva_flag) > > { > > int orders = get_suggested_order(vma->vm_mm, vma, vma->vm_flags, tva_flag, > > BIT(PMD_ORDER)); > > > > return orders & BIT(PMD_ORDER); > > } > > > > It's good the tva flag gives context though. > > Thanks for the suggestion. > will change it. Thanks! > > > > > > + > > > if (!(vmf->flags & FAULT_FLAG_WRITE) && > > > !mm_forbids_zeropage(vma->vm_mm) && > > > transparent_hugepage_use_zero_page()) { > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > > > index d3d4f116e14b..935583626db6 100644 > > > --- a/mm/khugepaged.c > > > +++ b/mm/khugepaged.c > > > @@ -474,7 +474,9 @@ void khugepaged_enter_vma(struct vm_area_struct *vma, > > > { > > > if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) && > > > hugepage_pmd_enabled()) { > > > - if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) > > > + if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER) && > > > + get_suggested_order(vma->vm_mm, vma, vm_flags, TVA_KHUGEPAGED, > > > + BIT(PMD_ORDER))) > > > > I don't know why we aren't working the bpf hook into thp_vma_allowable_order()? > > Actually it can be added into thp_vma_allowable_order(). I will change it. Thanks! > > > > > Also a helper would work here. > > > > > __khugepaged_enter(vma->vm_mm); > > > } > > > } > > > @@ -934,6 +936,8 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address, > > > return SCAN_ADDRESS_RANGE; > > > if (!thp_vma_allowable_order(vma, vma->vm_flags, type, PMD_ORDER)) > > > return SCAN_VMA_CHECK; > > > + if (!get_suggested_order(vma->vm_mm, vma, vma->vm_flags, type, BIT(PMD_ORDER))) > > > + return SCAN_VMA_CHECK; > > > > > > > > > /* > > > * Anon VMA expected, the address may be unmapped then > > > * remapped to file after khugepaged reaquired the mmap_lock. > > > @@ -1465,6 +1469,11 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot) > > > /* khugepaged_mm_lock actually not necessary for the below */ > > > mm_slot_free(mm_slot_cache, mm_slot); > > > mmdrop(mm); > > > + } else if (!get_suggested_order(mm, NULL, 0, -1, BIT(PMD_ORDER))) { > > > + hash_del(&slot->hash); > > > + list_del(&slot->mm_node); > > > + mm_flags_clear(MMF_VM_HUGEPAGE, mm); > > > + mm_slot_free(mm_slot_cache, mm_slot); > > > } > > > } > > > > > > @@ -1538,6 +1547,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, > > > if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER)) > > > return SCAN_VMA_CHECK; > > > > > > + if (!get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_FORCED_COLLAPSE, > > > + BIT(PMD_ORDER))) > > > > Again, can we please not duplicate thp_vma_allowable_order() logic? > > > > The THP code is horrible enough, but now we have to remember to also do the bpf > > check? > > makes sense. > > > > > > + return SCAN_VMA_CHECK; > > > /* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */ > > > if (userfaultfd_wp(vma)) > > > return SCAN_PTE_UFFD_WP; > > > @@ -2416,6 +2428,10 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result, > > > * the next mm on the list. > > > */ > > > vma = NULL; > > > + > > > + /* If this mm is not suitable for the scan list, we should remove it. */ > > > + if (!get_suggested_order(mm, NULL, 0, -1, BIT(PMD_ORDER))) > > > + goto breakouterloop_mmap_lock; > > > > OK again I'm really not loving this NULL, 0, -1 stuff. What is this supposed to > > mean? The idea here is we have a hook for 'trying to determine THP order' and > > now it's overloaded it seems in multiple ways? > > > > I may be missing context here. > > > > I'm also a bit perplexed by the comment as to what is intended here. > > Using a BPF-based approach for THP adjustment allows us to dynamically > enable or disable THP for running applications without causing any > disruption. This capability is particularly valuable in production > environments. The logic here is designed to achieve exactly that. > > > > > > > if (unlikely(!mmap_read_trylock(mm))) > > > goto breakouterloop_mmap_lock; > > > > > > @@ -2432,7 +2448,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result, > > > progress++; > > > break; > > > } > > > - if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) { > > > + if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER) || > > > + !get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_KHUGEPAGED, > > > + BIT(PMD_ORDER))) { > > > > Same various comments from above. > > will change it. > > > > > > skip: > > > progress++; > > > continue; > > > @@ -2769,6 +2787,10 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start, > > > if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER)) > > > return -EINVAL; > > > > > > + if (!get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_FORCED_COLLAPSE, > > > + BIT(PMD_ORDER))) > > > + return -EINVAL; > > > + > > > > Same various comments from above. > > will change it. > > > > > > cc = kmalloc(sizeof(*cc), GFP_KERNEL); > > > if (!cc) > > > return -ENOMEM; > > > diff --git a/mm/memory.c b/mm/memory.c > > > index d9de6c056179..0178857aa058 100644 > > > --- a/mm/memory.c > > > +++ b/mm/memory.c > > > @@ -4486,6 +4486,7 @@ static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset, > > > static struct folio *alloc_swap_folio(struct vm_fault *vmf) > > > { > > > struct vm_area_struct *vma = vmf->vma; > > > + int order, suggested_orders; > > > unsigned long orders; > > > struct folio *folio; > > > unsigned long addr; > > > @@ -4493,7 +4494,6 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) > > > spinlock_t *ptl; > > > pte_t *pte; > > > gfp_t gfp; > > > - int order; > > > > > > /* > > > * If uffd is active for the vma we need per-page fault fidelity to > > > @@ -4510,13 +4510,18 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) > > > if (!zswap_never_enabled()) > > > goto fallback; > > > > > > + suggested_orders = get_suggested_order(vma->vm_mm, vma, vma->vm_flags, > > > + TVA_PAGEFAULT, > > > + BIT(PMD_ORDER) - 1); > > > + if (!suggested_orders) > > > + goto fallback; > > (Thanks for all above! :) > > Wait, but below we have a bunch of fallbacks, now BPF overrides everything? > > When allocating high-order pages is not feasible, such as during > periods of high memory pressure, the system should immediately fall > back to using 4 KB pages. OK makes sense. > > > > > I know I'm repaeting myself :P but can we just please put this into > > thp_vma_allowable_orders(), it's massively gross to just duplicate this check > > _everywhere_ with subtle differences. > > will change it. Thanks > > > > > > entry = pte_to_swp_entry(vmf->orig_pte); > > > /* > > > * Get a list of all the (large) orders below PMD_ORDER that are enabled > > > * and suitable for swapping THP. > > > */ > > > orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT, > > > - BIT(PMD_ORDER) - 1); > > > + suggested_orders); > > > orders = thp_vma_suitable_orders(vma, vmf->address, orders); > > > orders = thp_swap_suitable_orders(swp_offset(entry), > > > vmf->address, orders); > > > @@ -5044,12 +5049,12 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf) > > > { > > > struct vm_area_struct *vma = vmf->vma; > > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > > > + int order, suggested_orders; > > > unsigned long orders; > > > struct folio *folio; > > > unsigned long addr; > > > pte_t *pte; > > > gfp_t gfp; > > > - int order; > > > > > > /* > > > * If uffd is active for the vma we need per-page fault fidelity to > > > @@ -5058,13 +5063,18 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf) > > > if (unlikely(userfaultfd_armed(vma))) > > > goto fallback; > > > > > > + suggested_orders = get_suggested_order(vma->vm_mm, vma, vma->vm_flags, > > > + TVA_PAGEFAULT, > > > + BIT(PMD_ORDER) - 1); > > > + if (!suggested_orders) > > > + goto fallback; > > > > Same comment as above. > > will change it. Thanks! > > > Thanks a lot for your comments. No problem, thanks for the series! I am generally excited about exploring this, so once we figure out details be good to see where this can go! > > > -- > Regards > > Yafang Cheers, Lorenzo