From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-ej1-f53.google.com (mail-ej1-f53.google.com [209.85.218.53]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 12B6FC133; Tue, 10 Jun 2025 15:30:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.218.53 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1749569449; cv=none; b=BiKMe8P9fK+fOQ9Hvmp/OELgNN54gnKs4H5YXb+Hvm//2AXdhkn0THXpWHDchewDBdGNw5GbTncziaECAseE05B/gZuxXPgDvvNhTNdHx1FBSXzmUav/1awyBEhGI/E6Jc1oqqM04uyGYTuNRmv5CtqcoFih0jSRfdgbRvX93LA= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1749569449; c=relaxed/simple; bh=Pu66h1FXrk1ccRbzPP0xh+nMsa8oTTl9nJfsbwgc0oQ=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=SWqi6zzyrQUYMMF6DXSx0ceskb6O51DtzNjebSTT0IvQ/aQIbt6MfxRD1Gr3YA5oDx3oAF82hx5clLj6J9yT6uPqbYaur48eyU5BzxlNGgcmQd3qNKiz5U+arA8+IMQnuJUjqJFQeOSYhvR/LT1wtpdfBrjwWrBxuJTRqS0CsJU= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=IV0FkpdT; arc=none smtp.client-ip=209.85.218.53 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="IV0FkpdT" Received: by mail-ej1-f53.google.com with SMTP id a640c23a62f3a-addda47ebeaso1080619666b.1; Tue, 10 Jun 2025 08:30:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1749569445; x=1750174245; darn=vger.kernel.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=2KxXjMWsc4kiONjKU0EHZgLfdDeBtPvIvaVMvTA9nU4=; b=IV0FkpdTrwInyiJBl9v6FLXflPWk3xnRMtqljAhqojloYFQqlMjteelepa+5iu/40i AaL6D7zUvsrAwWNrHfO4VmF5AjMHwOOo0yj0Mncm/C1Sn+VU5OVlKmjRRk8HvvTzajjc mPjakDL3agohHw3fmqf+EtboPUPo7/ZAJ5JNj1mvLktUDV2UH4TiWmm6PK52oPWDA+gG 8pvq7F4VtvxZ5xGjzmyNYTF85tZ7rAKdDSH4qzkuGirWmn3LXNyZuWXeFfFlKSAiLFHH 2dE8Qm/E5UArJqjmYlx7RsvSV/SzZrYo7vFIXX0hUZhk7+4+pkjRDYGBTczzLVywHKAR a+Ng== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1749569445; x=1750174245; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=2KxXjMWsc4kiONjKU0EHZgLfdDeBtPvIvaVMvTA9nU4=; b=M7PEkUOKXhr2n9vuNUfnhYAGbEYlWpYzSndtoK9WRliXiEbOi5rJ7Kha9Jcj0XOvX1 ZSDjmbtal9YvniMjDvEzTsgIbji7Z1j6cV37uWpzv0+0jrM5ajpoeXMQPN6AjGMmFKGd AAkyFg7wPE//CjdSqTzQicyf0aevhKewuGhTSGrhpYzgnM1YGB2o7XLVNd9rcHsGCrBE bQNmC1zZ5tw3cvyVXxoy5Mk79X7dPSpPylKnGQOkN/h48qIZ+CKv1jSXUFIjD8KC7Oh4 NDI2XHnI+XPPcssVm6UbXbANp9Qoz4vZkx6IH8ywukWpPUnIHT/1nMRJFUd9X8xpvrFU LGOA== X-Forwarded-Encrypted: i=1; AJvYcCUDHKMa5RGJRtfiv4b5AdB4S/646WNYEiXAPRe7bsp0zddWjR0vx0zSsCQlajh4ZLEZoZl0gtCOf3GJOA==@vger.kernel.org, AJvYcCVqvkkWlhZxjO/YROw2vyqsWT4Tlxhik5i+FmdO6tSL3tvI+hmnET3n2RsWbftq05OBTRmmJewOXyGqyk1H@vger.kernel.org, AJvYcCWgEHAp/4tr/4mU/Qatgfz0zYG3dsqw91mc7nDSzUIrIR+kl8/AQ0KU/NyHby3PNnnkqtEvn7m2Zaw=@vger.kernel.org X-Gm-Message-State: AOJu0Yzs3eQCC6/4yrzGnBzzDTDm8LZZj5o7v1GmuW11HBGchp3loX8d Q+u2+U8J9nz3C63m5HjH3+jYmDL2SZ+b8USvvFLfzb00Ft5IQWYNa2tn X-Gm-Gg: ASbGncuiTK1dEWQ9+eVsu3FJl2ih0P0RFzFjER0wZuzZDJb+Fq6mPYSkE+DmbwLnS4G G8x+KRkFY5kzhKQzDcSY6nykYVcqreco0lY0KfRFBIeRFmRsv7eZHzYL6uuslaIGQRwQXa3TTKL iVvaD4MrQPQ01+9663FhPnqDy97u6hF4vINQaKK9GsGZXHE9yB4yMXUtyLb3uQlgbBU5UJir2lX 06tPl9jarUiwsgmiDtYiOih55xq+pB+UaZOx6WPHdNrOjx8O/g67FXMT72YnpW6yrNKDvEuM1uP OO2+hxXjBBELFb29Sz3NkbjtZTKYJUZ61XWdI3rOn9ZgD0U0tjh0czdPFBi/jyiiwlE2KJ9f5Xf jaSH2Vr3tSplehovY/D9wHqerR5jP39eV8otBmQ== X-Google-Smtp-Source: AGHT+IH/PmzqOQmwt2N/paPLph3QOSAP//P2dit1qoFl7YHlvMpadlUnH+Ly1IPa8Mqxklhv49Ml6w== X-Received: by 2002:a17:907:940b:b0:ad8:8883:9fef with SMTP id a640c23a62f3a-ade1aa49831mr1564208466b.26.1749569444828; Tue, 10 Jun 2025 08:30:44 -0700 (PDT) Received: from ?IPV6:2a03:83e0:1126:4:c2f:a34:6718:ee1d? ([2620:10d:c092:500::7:b9b7]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-ade1d753a6csm736225566b.2.2025.06.10.08.30.43 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 10 Jun 2025 08:30:44 -0700 (PDT) Message-ID: <2fd7f80c-2b13-4478-900a-d65547586db3@gmail.com> Date: Tue, 10 Jun 2025 16:30:43 +0100 Precedence: bulk X-Mailing-List: linux-api@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [DISCUSSION] proposed mctl() API To: Lorenzo Stoakes , David Hildenbrand Cc: Andrew Morton , Shakeel Butt , "Liam R . Howlett" , Vlastimil Babka , Jann Horn , Arnd Bergmann , Christian Brauner , SeongJae Park , Mike Rapoport , Johannes Weiner , Barry Song <21cnbao@gmail.com>, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, Pedro Falcato , Matthew Wilcox References: <85778a76-7dc8-4ea8-8827-acb45f74ee05@lucifer.local> Content-Language: en-US From: Usama Arif In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit On 10/06/2025 16:17, Lorenzo Stoakes wrote: > On Tue, Jun 10, 2025 at 04:03:07PM +0100, Usama Arif wrote: >> >> >> On 30/05/2025 14:10, Lorenzo Stoakes wrote: >>> On Thu, May 29, 2025 at 06:21:55PM +0100, Usama Arif wrote: >>>> >>>> >>>> My knowledge is security is limited, so please bare with me, but I actually >>>> didn't understand the security issue and the need for CAP_SYS_ADMIN for >>>> doing VM_(NO)HUGEPAGE. >>>> >>>> A process can already madvise its own VMAs, and this is just doing that >>>> for the entire process. And VM_INIT_DEF_MASK is already set to VM_NOHUGEPAGE >>>> so it will be inherited by the parent. Just adding VM_HUGEPAGE shouldnt be >>>> a issue? Inheriting MMF_VM_HUGEPAGE will mean that khugepaged would enter >>>> for that process as well, which again doesnt seem like a security issue >>>> to me. >>> >>> W.R.T. the current process, the Issue is one Jann raised, in relation to >>> propagation of behaviour to privileged (e.g. setuid) processes. >>> >> >> But what is the actual security issue of having hugepages (or not having them) when >> the process is running with setuid? > > Speak to Jann about this. Security isn't my area. He gave feedback on this, > which is why I raised it, if you search through previous threads you can find > it. > Yes, he is in CC here as well. I have read it in the previous thread. Just raising it here as it was mentioned here :) >> >> I know the cgroup proposal has been shot down, but lets imagine if this was a cgroup >> setting, similar to the other memory controls we have, for e.g. memory.swap.{max,high,peak}. >> >> We can chown the cgroup so that the property is set by unprivileged process. >> >> Having the process swap with setuid when the unprivileged process has swap disabled >> in the cgroup is not the right behaviour. What currently happens is that the process >> after obtaining the higher privilege level doesn't swap as well. >> >> Similarly for hugepages, if it was a cgroup level setting, having the process give >> hugepages always with setuid when the unprivileged user had it disabled it or vice versa >> would not be the right behaviour. >> >> Another example is PR_SET_MEMORY_MERGE, setuid does not change how it works as far as >> I can tell. >> >> So madlibs I dont see what the security issue is and why we would need to elevate privileges >> to do this. >> >>> W.R.T. remote processes, obviously we want to make sure we are permitted to do >>> so. >>> >> >> I know that this needs to be future proof. But I don't actually know of a real world >> usecase where we want to do any of these things for remote processes. >> Whether its the existing per process changes like PR_SET_MEMORY_MERGE for KSM and >> PR_SET_THP_DISABLE for THP or the newer proposals of PR_DEFAULT_MADV_(NO)HUGEPAGE >> or Barrys proposal. >> All of them are for the process itself (and its children by fork+exec) and not for >> remote processes. As we try to make our changes usecase driven, I think we should >> not add support for remote processes (which is another reason why I think this might >> sit better in prctl). > > I'm extremely confused as to why you think this propoal is predicated upon > remote process manipulation? It was simply suggested as a possibility for > increased flexibility. > > We can just remove this parameter no? > Sure. > It is entirely orthogonal to the prctl() stuff. > > Overall at this point I share Matthew's point of view on this - we shouldn't be > doing any of this upstream. As I replied to Matthew in [1], it would be amazing if it was not needed, but thats not how it works in the medium term and I dont think it will work even in the long term. I will paste my answer from [1] below as well: If we have 2 workloads on the same server, For e.g. one is database where THPs just dont do well, but the other one is AI where THPs do really well. How will the kernel monitor that the database workload is performing worse and the AI one isnt? I added THP shrinker to hopefully try and do this automatically, and it does really help. But unfortunately it is not a complete solution. There are severely memory bound workloads where even a tiny increase in memory will lead to an OOM. And if you colocate the container thats running that workload with one in which we will benefit with THPs, we unfortunately can't just rely on the system doing the right thing. It would be awesome if THPs are truly transparent and don't require any input, but unfortunately I don't think that there is a solution for this with just kernel monitoring. This is just a big hint from the user. If the global system policy is madvise and the workload owner has done their own benchmarks and see benefits with always, they set DEFAULT_MADV_HUGEPAGE for the process to optin as "always". If the global system policy is always and the workload owner has done their own benchmarks and see worse results with always, they set DEFAULT_MADV_NOHUGEPAGE for the process to optin as "madvise". [1] https://lore.kernel.org/all/162c14e6-0b16-4698-bd76-735037ea0d73@gmail.com/ I havent seen activity on this thread over the past week, but I was hoping we can reach a consensus on which approach to use, prctl or mctl. If its mctl and if you don't think this should be done, please let me know if you would like me to work on this instead. This is a valid big realworld usecase that is a real blocker for deploying THPs in workloads in servers. Thanks! Usama