From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2B189C77B7C for ; Wed, 2 Jul 2025 14:15:10 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C1AEF940007; Wed, 2 Jul 2025 10:15:09 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BF2B6900004; Wed, 2 Jul 2025 10:15:09 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B2F54940007; Wed, 2 Jul 2025 10:15:09 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id A10AA900004 for ; Wed, 2 Jul 2025 10:15:09 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 00C0359695 for ; Wed, 2 Jul 2025 14:15:08 +0000 (UTC) X-FDA: 83619521538.25.9E0E2E5 Received: from mail-ed1-f53.google.com (mail-ed1-f53.google.com [209.85.208.53]) by imf24.hostedemail.com (Postfix) with ESMTP id E4063180008 for ; Wed, 2 Jul 2025 14:15:06 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="mPZRdUV/"; spf=pass (imf24.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.208.53 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1751465707; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Ge8cq6uMbt9t+bUAaC+G6fQUxnqQJz/zQaWC4Noi8MI=; b=bBI0hU9pE1zd2rT+Fca6f1jGrhGW0o1x1dwnySI71i5lNHrpsTrJApOFZE6l06Yti/Bu4M hc4nSdQbSEiIJz0OFsB1H8uJJY0Q+E+yIGfIc6PSNeBWEpmRFTw7eXDYdEruHcENWx9KnV qh2FIXFYnHRIN+0STREWUfnAIYv5zVQ= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="mPZRdUV/"; spf=pass (imf24.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.208.53 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1751465707; a=rsa-sha256; cv=none; b=nk9oty0AZpSYM2cMqvxYzSnaDpLa/phnvKd2JMBeR0NVISx8yfs6Sia8JKArE81hPK+vge Efyk5Y9Ko/Etv45veDq4IXYBAUDc/doG16pybSlc3SJdgcIAc5TvUN8oATRQ4fVd1/lgUp XfmDAt3UgJu1oxMgPeOOVzAlxRX2gaU= Received: by mail-ed1-f53.google.com with SMTP id 4fb4d7f45d1cf-60780d74c8cso7627636a12.2 for ; Wed, 02 Jul 2025 07:15:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1751465705; x=1752070505; darn=kvack.org; h=content-transfer-encoding:in-reply-to:content-language:references :cc:to:subject:from:user-agent:mime-version:date:message-id:from:to :cc:subject:date:message-id:reply-to; bh=Ge8cq6uMbt9t+bUAaC+G6fQUxnqQJz/zQaWC4Noi8MI=; b=mPZRdUV/B2jjPZYsWBvU3uh+D1o2oZM8qFUxq6B1rJDFRgRMkxWAsW6vtSqP5Jb01y nwtrj2F+i6157T71zQ6EIXgzqcbDu7nIF9+fgwaeYKt43qPxnlaZjxZwgTZlHW2OlO4o Fxx1eKcGg3+Ex983kOPMJpLtMUElqFLfWN4MamJbnWUK+V1r1796A3E7SRXBvurJTkqp DhP/aMWingIGMIz7c0RZL+E2i7fBfH8qFT64q2NL2z89pvV4D9epo7qukaVWeJyORJJl 6KisB2EDbvctr33o3hHR3ek8dCKRZfuuu65jSncvgSpTQ05rNUyQU2drp27hF6ZFe1YR URJw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1751465705; x=1752070505; h=content-transfer-encoding:in-reply-to:content-language:references :cc:to:subject:from:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Ge8cq6uMbt9t+bUAaC+G6fQUxnqQJz/zQaWC4Noi8MI=; b=oH1XBFjXvpc+/0o+9A/CRPhBTCd/VyPz2ny0d0xx1MfDSvdvRJxnGFkYcGjr+Bpmyw kpCzqSlHiV30bourEVRwMi1OyfzRa7un5dRlkLqEhupDKzEx563wfcn1mhCeyxXb6X9t 1FA0UDaA/OW7Mh0lrVpsHw8zqYFXlA0evJSSjh6KXnjB1gXLjbgqJSbzywIJFkSRfexD VpNv7rEXBGEv3Gw3rs0dnwc4UBfj+a7ZmtM7866zKVLSJdkQMcvQv1JUofGMyrUAIzT8 RrDfX3tSpPy5Y4GbMC3VuiRMiplu5tWajLuKKYhvJO3Y4TFoOGqeXbV0HcBpOOlLX0c4 dn6w== X-Forwarded-Encrypted: i=1; AJvYcCVmKJZv3kGaFQBXnvZ3jFUWyGmS6fxFybXKK1Qfckjz+PsSvBad0wGFEFM8dTKa+8X51/LQ9lyVZQ==@kvack.org X-Gm-Message-State: AOJu0YxnA5zSfcrlgkq8I2uwf5kMpy5Eb7ex1h1XHi+OAVNCtmwoXsHe ia5MtKTMQ9eK5OCGKthq2ZkOnvi930qRtegRROKJftyPMhSjkaooDFr4 X-Gm-Gg: ASbGnct0PudCaEIW7V34OOM+O5EnplmQrWF4JRapdRXry3QN2sxDgS9bKGikJz5XWB2 nCNC6y1zH1AS7gt/qaBPoXGIcBvE9j5faauZH17EOzoCdM68MlOqca4hAB5RHerO/JIf6VzbIUi kFvmfSap2NyJdwYHsvPvr6EGYqlpOVJuVgKfRtYmQ+l4rqq+Hpd9J4nQnvm4bBzsQJmSyKBUOnk 6AnOwV1DgCfBhkrZz2TosVwsdC5pC8rGk5egDsnpH0xbcBsW6A2SGWFr0YDXDegV5vjuBHlTn32 x3DjnKYnFC1SlSB6LKpBPH/shJ+sFGtCcfW8QR7W1qXTpWP9RhOlnNJfPPlhNsGkQiz4HI7P5xx +foynDG2fK8csxgpt4HzzSR9KzBWrcCYg X-Google-Smtp-Source: AGHT+IHd8th8DxuBlZIOOIpxG5BirHhAP4zcid9RIJhvhiltbyTblS1FYKs0knm08Gq5EYUrkam3vw== X-Received: by 2002:a05:6402:2748:b0:60b:9f77:e514 with SMTP id 4fb4d7f45d1cf-60e52cc6896mr2681925a12.10.1751465704824; Wed, 02 Jul 2025 07:15:04 -0700 (PDT) Received: from ?IPV6:2a03:83e0:1126:4:7e:645c:aa81:5180? ([2620:10d:c092:500::7:3d29]) by smtp.gmail.com with ESMTPSA id 4fb4d7f45d1cf-60e5818db90sm895836a12.56.2025.07.02.07.15.04 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 02 Jul 2025 07:15:04 -0700 (PDT) Message-ID: <6d8832bb-b5a7-4cd9-b92c-c93f2c1fe182@gmail.com> Date: Wed, 2 Jul 2025 15:15:01 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird From: Usama Arif Subject: Re: [DISCUSSION] proposed mctl() API To: David Hildenbrand , linux-mm@kvack.org, Andrew Morton Cc: Shakeel Butt , "Liam R . Howlett" , Vlastimil Babka , Jann Horn , Arnd Bergmann , Christian Brauner , SeongJae Park , Mike Rapoport , Johannes Weiner , Barry Song <21cnbao@gmail.com>, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, Pedro Falcato , Matthew Wilcox , Lorenzo Stoakes References: <85778a76-7dc8-4ea8-8827-acb45f74ee05@lucifer.local> <2fd7f80c-2b13-4478-900a-d65547586db3@gmail.com> Content-Language: en-US In-Reply-To: <2fd7f80c-2b13-4478-900a-d65547586db3@gmail.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Stat-Signature: nhyu6gmw8rzia3kk8q5gpb8f987sksnr X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: E4063180008 X-Rspam-User: X-HE-Tag: 1751465706-54952 X-HE-Meta: U2FsdGVkX19aNpQvvsJ7fzsx+omsJDSuTpPCCeZ+mtR97hpeWKMvReNhLoD5CYD1QuKyLQVItRd4VHRdMtRVPRxuPjECxPMtMNwPXr4NJEof5hmbzGywt6I5pD0CGkFQhH8Y3rp1GeqOj4sspSmZusrJGyNDeFWz692teEAGn2SwCjY0SYc/FfNH4MmTsPDdXmh9qlkeH8pUNzCU/vW98O0Z7SxUeMojGFVRYNrZw6ODZtBtSVb17DX/lBu1/1uHfGKsKfBoZGpwEmCj2vdj0G8rr4C3iMflC1AV79zeH3k4NaPRbE2pVxR8hkbvJsaOR9ljJLa3BTKxBRDMlQclwG7ncAg8DvpFH58iuDZzF4fRzO3y3SjyCwS7ud4oJiHIOCfNfVjBj8cDFyHxUmlw9w43FJDR1yCNjD9HkTRySOZT/2or/pHUviT/iPFpamkhUEcu4X7v+1ZiiL17kE0wRrz4vqkRmJRYM5VAsMQToxnPpfqp/B9tY455ZvDbI6VCUH/fYaank6ez1TsjXfENYS0emmCw+Fuh03bjVdaL3eTmvzySrLBtdGKRMiDJ14QmM33yfcWPCuspPocTNW5J5E7epwvb7o64aN30oKjLtaeQw+VcGHWVeJcNcR/cy5nd830hdChNc4pYZGAjjp7cxjvi5qb+rN/sy9io4sBaHwOemsjJYbHNV0ify9UxcqjRG4dkPySEFAteotcvGdZUZtfhh4qCBJwrLcoSzyNLpNNIu7sjoqAMfKMh18s9ISb9A5zxH0WfxO29Zp6pj/e65diDvdVCKEE7Qn3nnBs6TaRsunhdN8O9peOHAhkJAG8p5gBgW1Q5dG7IfRQdO2Zd+8L7JEMJU2k6R/js/KuJ6gXacNQIt2gQneWGSy4Rpd51i7v2x/xbzfiI4/B5rTGgHeTK541TbxcDdaqYfyF0aTQic5T4mkL/NA3g2XbOJIQrs9SML+O9fm4bkbfbfML UNdZgKjB EnJmTyXLepjakpQG82JfrAxMGjcXfBDZsEp+wNMFgOtuKFnsZTzUwjDT7QuW+jZk0ZhByWU29h6qWIEGLahJF8YaIJ4pHIpY4jY2vZM0JJodYZXPIh8vNwiyPuQnIq2gXjAMYmTMBmk6W2RWRJzdb29Ws4nX0auAdINGduCOZGgpONe3No7sk82cwpddtBT0Ewqt841sKDiyZSBCqW7jpTnFK4ZfEB4nSoVzHBHg70PcP/pb0bBgSirMTcjOp1I/m5M2a0/koFqsqT2Eo0510qNGcO9xpH24sHx8eUmY/Y6ver3J72If4p/GHZKb5oPYnBzsKqgg8j7TT2QmIbBmsZE4UVniP2olYyM0XXDs3QOE1t99J5C2qGwIsz5Pcy74h9Hlf0iw+LUOmbC1GToRljFszqFeBVWVuEoMr7RMZkJlbzq73+i0GtOa/NAKu6itmbWQXS4RA/qBxbqWjvX3lfytFiaBHACMw5EFDG1WEoX06d61TH3qLYof12lT97RmON/64g2GvH5A/6fQBdqEmawjyqid3j9XNexpS3oWSAWw126HdE7ozYdo+/XFOF7GOB5HlbjNxeKJGgZU1jL7zT3oC/2DpzWYaRipf X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: > As I replied to Matthew in [1], it would be amazing if it was not needed, but thats not > how it works in the medium term and I dont think it will work even in the long term. > I will paste my answer from [1] below as well: > > If we have 2 workloads on the same server, For e.g. one is database where THPs > just dont do well, but the other one is AI where THPs do really well. How > will the kernel monitor that the database workload is performing worse > and the AI one isnt? > > I added THP shrinker to hopefully try and do this automatically, and it does > really help. But unfortunately it is not a complete solution. > There are severely memory bound workloads where even a tiny increase > in memory will lead to an OOM. And if you colocate the container thats running > that workload with one in which we will benefit with THPs, we unfortunately > can't just rely on the system doing the right thing. > > It would be awesome if THPs are truly transparent and don't require > any input, but unfortunately I don't think that there is a solution > for this with just kernel monitoring. > > This is just a big hint from the user. If the global system policy is madvise > and the workload owner has done their own benchmarks and see benefits > with always, they set DEFAULT_MADV_HUGEPAGE for the process to optin as "always". > If the global system policy is always and the workload owner has done their own > benchmarks and see worse results with always, they set DEFAULT_MADV_NOHUGEPAGE for > the process to optin as "madvise". > > [1] https://lore.kernel.org/all/162c14e6-0b16-4698-bd76-735037ea0d73@gmail.com/ > > > I havent seen activity on this thread over the past week, but I was hoping > we can reach a consensus on which approach to use, prctl or mctl. > If its mctl and if you don't think this should be done, please let me know > if you would like me to work on this instead. This is a valid big realworld > usecase that is a real blocker for deploying THPs in workloads in servers. > Hi! Just wanted to check if anyone has any thoughts on this? I think we are all in agreement for the long term eventual goal, have THP just work and be default enabled. From our perspective, we (meta) have spent a significant amount of time and effort over the last 18 months trying to make changes/optimizations where we could actually have it so and can transparently and reliably get hugepages. This includes the THP shrinker [1], changes to allocator and reclaim/compaction code to reduce fragmentation [2] and reducing type mixing [3]. We want to continue spending more time and resources in trying to achieve this. But in the current state, not being able to selectively enable THPs always for certain workloads is a significant blocker in trying to roll it out at hyperscaler levels, and from the attempts made by others, I do believe its a problem others are facing as well. Our long term aim is to have transparent_hugepage/enabled set to always across the fleet. But for that we need to have the ability to enable it selectively for workloads that show benefit, try and solve problems that come up in production when it is enabled, and see why for those that regress. This can not be done with just transparent_hugepage/enabled for hyperscalers which run vastly different types of containerized workloads on the same machine. There have been multiple efforts from different people on trying to address similar problems (including via cgroup[4] and bpf[5]). IMHO, its quite clear that unfortunately just having a system wide setting for THP is not enough at the moment or in the near future, especially when running workloads that have completely different characteristic on the same server. In terms of the approach of doing this, IMHO, I dont think the way to do this is controversial. After the great feedback from Lorenzo on the prctl series, the approach would be for userpsace to make a call that just does for_each_vma of the process, madvises the VMAs, and changes the mm->def_flags for the process. We are not making changes to the pagefaulting path (thp_vma_allowable_orders has no code change which is awesome). In terms of what the call is going to be, there has been a lot of debate (and my preference of prctl is clear), I am ok with either with prctl or mctl, as it should not change the actual implementation. If there is consensus, I would love to send a RFC for how the prctl or mctl solution would look like. [1] https://lore.kernel.org/all/20240830100438.3623486-1-usamaarif642@gmail.com/ [2] https://lore.kernel.org/all/20250313210647.1314586-1-hannes@cmpxchg.org/ [3] https://lore.kernel.org/all/20240320180429.678181-1-hannes@cmpxchg.org/ [4] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/ [5] https://lore.kernel.org/all/20250520060504.20251-1-laoar.shao@gmail.com/ Thanks, Usama