From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DE46E1FC3 for ; Thu, 24 Jul 2025 19:14:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753384449; cv=none; b=pa6MRWJ1+uCZKyeNqaB9p7UZfVtnb+9NPxxRn9kQqjWnJSft+iFt/iGCdaCMO/IMJwH4qo3QjP+wmCUY0AuWHY94vdM5kVoIOiC/lffx6gC0MWrNx4EHPi/ztQlxY21wLJL1uDXj0a0xo+9OLfEFh4NsNrUyXuIisnQV1Zyz0Ag= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753384449; c=relaxed/simple; bh=itEhRIVR9wH0XL59GwhTw36oCEtz/+K23ATwCNdEu10=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=BYIgn1Gyn3wgjnibacYJMOfIT5b9GcurnoZjsAXYFFkqolTAi3qBLoqdH0We0gH+Lqyj4H3l+27muTOIpn17KV68nE+ClS7SrPXXN5BUMQmWoD7fP7BLCMpHzNfrSlVWSbyEPcVKtEvDw7YEga9jzcc5t9Qx9GawsE5DPFag1AE= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=c7XPnZZ9; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="c7XPnZZ9" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1753384445; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:autocrypt:autocrypt; bh=JroCquJ2N/9P4PgYxL/uw2uhd9P40fm8biIZJBC9nOc=; b=c7XPnZZ9lueOEigFPrGjzOcFAH22gFzVvMM0EZ/5cVQVaBZHX9GGCUX+kvWmvpzke7NrHm QsLS3C0BkpyrW1AjROGh0KaFECpHdfWyHN/XmL9W8oZ6SAN6mvArYj/7915gsU6eu7qEyW xY221Cxya2T4ZR75v8fNWbg4Y2amTcE= Received: from mail-ed1-f70.google.com (mail-ed1-f70.google.com [209.85.208.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-675-Qj8zvlOiMTe0fZGLRNeunQ-1; Thu, 24 Jul 2025 15:14:04 -0400 X-MC-Unique: Qj8zvlOiMTe0fZGLRNeunQ-1 X-Mimecast-MFC-AGG-ID: Qj8zvlOiMTe0fZGLRNeunQ_1753384443 Received: by mail-ed1-f70.google.com with SMTP id 4fb4d7f45d1cf-6083f613f0eso1474764a12.1 for ; Thu, 24 Jul 2025 12:14:04 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1753384443; x=1753989243; h=content-transfer-encoding:in-reply-to:organization:autocrypt :content-language:from:references:cc:to:subject:user-agent :mime-version:date:message-id:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=JroCquJ2N/9P4PgYxL/uw2uhd9P40fm8biIZJBC9nOc=; b=WLyq2orvo1s8xzC8mu7Mtl/P1w1dNFp9vBW9Kyj+ScmjaQq2A6lJZY8KEYsVxkLQbG mEF4pZu3KqMoGfwXdcqgSNwjMGG2qQBgrnAmLSzIOv7MesTflp7vj96PKnY2YiSlmEl7 jzkm+BTNUxlptMrrhZUAILATZ0l3oOyLfKtdwehHsR1ekR2R0v7wnAyYaTvTyzZ8odGs oIuShx+YFBxIiBVDJiqerNrVEpoxGbjcauHtGBbGZPnyRm5qop4EyJyX7sDi+d87NCQR VDnHOtjlTx1/7jvPqQP0w/MBkw7zihRC353bVHe6kG2c+yTWeDlocif7V3sfjGmcl6hH EJdQ== X-Forwarded-Encrypted: i=1; AJvYcCVtvbg3BTXSFyoy7QgX3eJALEosrsErJk4cPvLrHDBts8lyjZKTQrCirmJb0SzkB63pGMGwoKTAv2Q=@vger.kernel.org X-Gm-Message-State: AOJu0Yyp1siUkAjKxxm+lVl0EOu+u99I8DLd91flwCHOzWt2DdN+aYvk +8eH0KoTOS5kTmQAFzFUmh3oyvTcpkoPAwiA5Q8n+bCJtxIQmskdX9rvOcO92Kn7OuSJphYRUAH bu+Hu4bjZd48dIrY6nBQ/mANHlqEcfKpl0nWvtqZEXtxtnia6mLG80c3Dc3J9Ag5blp1Xm6aj X-Gm-Gg: ASbGncuEcRQ8oMT1ttQfDDbKOhUS5f8+us2GJIPtSulfpAWdEQ+GPvks7VKwcZntJA5 a6ffetL7aKAggG7ONVkRboXy5TOm5ApWLTHzJAVk5QWfV2uV+OHut/FWFhv9Iy2dD3xsG5UaGEN X1SmceG5tk2kDK6j1mUt5dBREgHUpcQ58gztorZsm78i7Wpkia/zaLLvBiizwbhy15YRvBOjX2M wQx8UglAeONIf1XmMT0GbWSE7DXAwgs/5NLnu5uuZoWmBPBLegs35eRZkpQl4ept67vmby1+sov Dxjdd68wO7hOwzH8iFQHMO1qRO4S4kMJG6x6La3elzz096FIzJxffCIVFdyEnldZrEWHwNNEPBK Kdso813lq62i60+U2iptrOqBQyYUq8RZsvIyNNc25HbtX57zc3/Ya2tj8sA+1DPfHj4I= X-Received: by 2002:a17:906:c110:b0:ae3:f524:b51 with SMTP id a640c23a62f3a-af2f66c11afmr841585766b.10.1753384443059; Thu, 24 Jul 2025 12:14:03 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEu0koRqgoAG3Cni1RYCd4sAXADnOmfGnRrtCgM7GWwt7Ol/1KPVQcp4wXhSJm6qyxTuGAPAA== X-Received: by 2002:a05:6000:208a:b0:3a5:25e0:1851 with SMTP id ffacd0b85a97d-3b768eb077emr7272886f8f.7.1753384071638; Thu, 24 Jul 2025 12:07:51 -0700 (PDT) Received: from ?IPV6:2003:d8:2f01:5500:ba83:3fd7:6836:62f6? (p200300d82f015500ba833fd7683662f6.dip0.t-ipconnect.de. [2003:d8:2f01:5500:ba83:3fd7:6836:62f6]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-3b76fc605cfsm2899678f8f.13.2025.07.24.12.07.50 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 24 Jul 2025 12:07:51 -0700 (PDT) Message-ID: <601e015b-1f61-45e8-9db8-4e0d2bc1505e@redhat.com> Date: Thu, 24 Jul 2025 21:07:49 +0200 Precedence: bulk X-Mailing-List: linux-doc@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH POC] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE To: Usama Arif , linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-doc@vger.kernel.org, Jonathan Corbet , Andrew Morton , Lorenzo Stoakes , Zi Yan , Baolin Wang , "Liam R. Howlett" , Nico Pache , Ryan Roberts , Dev Jain , Barry Song , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , SeongJae Park , Jann Horn , Yafang Shao , Matthew Wilcox , Johannes Weiner References: <20250721090942.274650-1-david@redhat.com> <3ec01250-0ff3-4d04-9009-7b85b6058e41@gmail.com> From: David Hildenbrand Content-Language: en-US Autocrypt: addr=david@redhat.com; keydata= xsFNBFXLn5EBEAC+zYvAFJxCBY9Tr1xZgcESmxVNI/0ffzE/ZQOiHJl6mGkmA1R7/uUpiCjJ dBrn+lhhOYjjNefFQou6478faXE6o2AhmebqT4KiQoUQFV4R7y1KMEKoSyy8hQaK1umALTdL QZLQMzNE74ap+GDK0wnacPQFpcG1AE9RMq3aeErY5tujekBS32jfC/7AnH7I0v1v1TbbK3Gp XNeiN4QroO+5qaSr0ID2sz5jtBLRb15RMre27E1ImpaIv2Jw8NJgW0k/D1RyKCwaTsgRdwuK Kx/Y91XuSBdz0uOyU/S8kM1+ag0wvsGlpBVxRR/xw/E8M7TEwuCZQArqqTCmkG6HGcXFT0V9 PXFNNgV5jXMQRwU0O/ztJIQqsE5LsUomE//bLwzj9IVsaQpKDqW6TAPjcdBDPLHvriq7kGjt WhVhdl0qEYB8lkBEU7V2Yb+SYhmhpDrti9Fq1EsmhiHSkxJcGREoMK/63r9WLZYI3+4W2rAc UucZa4OT27U5ZISjNg3Ev0rxU5UH2/pT4wJCfxwocmqaRr6UYmrtZmND89X0KigoFD/XSeVv jwBRNjPAubK9/k5NoRrYqztM9W6sJqrH8+UWZ1Idd/DdmogJh0gNC0+N42Za9yBRURfIdKSb B3JfpUqcWwE7vUaYrHG1nw54pLUoPG6sAA7Mehl3nd4pZUALHwARAQABzSREYXZpZCBIaWxk ZW5icmFuZCA8ZGF2aWRAcmVkaGF0LmNvbT7CwZgEEwEIAEICGwMGCwkIBwMCBhUIAgkKCwQW AgMBAh4BAheAAhkBFiEEG9nKrXNcTDpGDfzKTd4Q9wD/g1oFAmgsLPQFCRvGjuMACgkQTd4Q 9wD/g1o0bxAAqYC7gTyGj5rZwvy1VesF6YoQncH0yI79lvXUYOX+Nngko4v4dTlOQvrd/vhb 02e9FtpA1CxgwdgIPFKIuXvdSyXAp0xXuIuRPQYbgNriQFkaBlHe9mSf8O09J3SCVa/5ezKM OLW/OONSV/Fr2VI1wxAYj3/Rb+U6rpzqIQ3Uh/5Rjmla6pTl7Z9/o1zKlVOX1SxVGSrlXhqt kwdbjdj/csSzoAbUF/duDuhyEl11/xStm/lBMzVuf3ZhV5SSgLAflLBo4l6mR5RolpPv5wad GpYS/hm7HsmEA0PBAPNb5DvZQ7vNaX23FlgylSXyv72UVsObHsu6pT4sfoxvJ5nJxvzGi69U s1uryvlAfS6E+D5ULrV35taTwSpcBAh0/RqRbV0mTc57vvAoXofBDcs3Z30IReFS34QSpjvl Hxbe7itHGuuhEVM1qmq2U72ezOQ7MzADbwCtn+yGeISQqeFn9QMAZVAkXsc9Wp0SW/WQKb76 FkSRalBZcc2vXM0VqhFVzTb6iNqYXqVKyuPKwhBunhTt6XnIfhpRgqveCPNIasSX05VQR6/a OBHZX3seTikp7A1z9iZIsdtJxB88dGkpeMj6qJ5RLzUsPUVPodEcz1B5aTEbYK6428H8MeLq NFPwmknOlDzQNC6RND8Ez7YEhzqvw7263MojcmmPcLelYbfOwU0EVcufkQEQAOfX3n0g0fZz Bgm/S2zF/kxQKCEKP8ID+Vz8sy2GpDvveBq4H2Y34XWsT1zLJdvqPI4af4ZSMxuerWjXbVWb T6d4odQIG0fKx4F8NccDqbgHeZRNajXeeJ3R7gAzvWvQNLz4piHrO/B4tf8svmRBL0ZB5P5A 2uhdwLU3NZuK22zpNn4is87BPWF8HhY0L5fafgDMOqnf4guJVJPYNPhUFzXUbPqOKOkL8ojk CXxkOFHAbjstSK5Ca3fKquY3rdX3DNo+EL7FvAiw1mUtS+5GeYE+RMnDCsVFm/C7kY8c2d0G NWkB9pJM5+mnIoFNxy7YBcldYATVeOHoY4LyaUWNnAvFYWp08dHWfZo9WCiJMuTfgtH9tc75 7QanMVdPt6fDK8UUXIBLQ2TWr/sQKE9xtFuEmoQGlE1l6bGaDnnMLcYu+Asp3kDT0w4zYGsx 5r6XQVRH4+5N6eHZiaeYtFOujp5n+pjBaQK7wUUjDilPQ5QMzIuCL4YjVoylWiBNknvQWBXS lQCWmavOT9sttGQXdPCC5ynI+1ymZC1ORZKANLnRAb0NH/UCzcsstw2TAkFnMEbo9Zu9w7Kv AxBQXWeXhJI9XQssfrf4Gusdqx8nPEpfOqCtbbwJMATbHyqLt7/oz/5deGuwxgb65pWIzufa N7eop7uh+6bezi+rugUI+w6DABEBAAHCwXwEGAEIACYCGwwWIQQb2cqtc1xMOkYN/MpN3hD3 AP+DWgUCaCwtJQUJG8aPFAAKCRBN3hD3AP+DWlDnD/4k2TW+HyOOOePVm23F5HOhNNd7nNv3 Vq2cLcW1DteHUdxMO0X+zqrKDHI5hgnE/E2QH9jyV8mB8l/ndElobciaJcbl1cM43vVzPIWn 01vW62oxUNtEvzLLxGLPTrnMxWdZgxr7ACCWKUnMGE2E8eca0cT2pnIJoQRz242xqe/nYxBB /BAK+dsxHIfcQzl88G83oaO7vb7s/cWMYRKOg+WIgp0MJ8DO2IU5JmUtyJB+V3YzzM4cMic3 bNn8nHjTWw/9+QQ5vg3TXHZ5XMu9mtfw2La3bHJ6AybL0DvEkdGxk6YHqJVEukciLMWDWqQQ RtbBhqcprgUxipNvdn9KwNpGciM+hNtM9kf9gt0fjv79l/FiSw6KbCPX9b636GzgNy0Ev2UV m00EtcpRXXMlEpbP4V947ufWVK2Mz7RFUfU4+ETDd1scMQDHzrXItryHLZWhopPI4Z+ps0rB CQHfSpl+wG4XbJJu1D8/Ww3FsO42TMFrNr2/cmqwuUZ0a0uxrpkNYrsGjkEu7a+9MheyTzcm vyU2knz5/stkTN2LKz5REqOe24oRnypjpAfaoxRYXs+F8wml519InWlwCra49IUSxD1hXPxO WBe5lqcozu9LpNDH/brVSzHCSb7vjNGvvSVESDuoiHK8gNlf0v+epy5WYd7CGAgODPvDShGN g3eXuA== Organization: Red Hat In-Reply-To: <3ec01250-0ff3-4d04-9009-7b85b6058e41@gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit On 24.07.25 20:57, Usama Arif wrote: > > > On 21/07/2025 10:09, David Hildenbrand wrote: >> People want to make use of more THPs, for example, moving from >> THP=never to THP=madvise, or from THP=madvise to THP=never. >> >> While this is great news for every THP desperately waiting to get >> allocated out there, apparently there are some workloads that require a >> bit of care during that transition: once problems are detected, these >> workloads should be started with the old behavior, without making all >> other workloads on the system go back to the old behavior as well. >> >> In essence, the following scenarios are imaginable: >> >> (1) Switch from THP=none to THP=madvise or THP=always, but keep the old >> behavior (no THP) for selected workloads. >> >> (2) Stay at THP=none, but have "madvise" or "always" behavior for >> selected workloads. >> >> (3) Switch from THP=madvise to THP=always, but keep the old behavior >> (THP only when advised) for selected workloads. >> >> (4) Stay at THP=madvise, but have "always" behavior for selected >> workloads. >> >> In essence, (2) can be emulated through (1), by setting THP!=none while >> disabling THPs for all processes that don't want THPs. It requires >> configuring all workloads, but that is a user-space problem to sort out. >> >> (4) can be emulated through (3) in a similar way. >> >> Back when (1) was relevant in the past, as people started enabling THPs, >> we added PR_SET_THP_DISABLE, so relevant workloads that were not ready >> yet (i.e., used by Redis) were able to just disable THPs completely. Redis >> still implements the option to use this interface to disable THPs >> completely. >> >> With PR_SET_THP_DISABLE, we added a way to force-disable THPs for a >> workload -- a process, including fork+exec'ed process hierarchy. >> That essentially made us support (1): simply disable THPs for all workloads >> that are not ready for THPs yet, while still enabling THPs system-wide. >> >> The quest for handling (3) and (4) started, but current approaches >> (completely new prctl, options to set other policies per processm, >> alternatives to prctl -- mctrl, cgroup handling) don't look particularly >> promising. Likely, the future will use bpf or something similar to >> implement better policies, in particular to also make better decisions >> about THP sizes to use, but this will certainly take a while as that work >> just started. >> >> Long story short: a simple enable/disable is not really suitable for the >> future, so we're not willing to add completely new toggles. >> >> While we could emulate (3)+(4) through (1)+(2) by simply disabling THPs >> completely for these processes, this scares many THPs in our system >> because they could no longer get allocated where they used to be allocated >> for: regions flagged as VM_HUGEPAGE. Apparently, that imposes a >> problem for relevant workloads, because "not THPs" is certainly worse >> than "THPs only when advised". >> >> Could we simply relax PR_SET_THP_DISABLE, to "disable THPs unless not >> explicitly advised by the app through MAD_HUGEPAGE"? *maybe*, but this >> would change the documented semantics quite a bit, and the versatility >> to use it for debugging purposes, so I am not 100% sure that is what we >> want -- although it would certainly be much easier. >> >> So instead, as an easy way forward for (3) and (4), an option to >> make PR_SET_THP_DISABLE disable *less* THPs for a process. >> >> In essence, this patch: >> >> (A) Adds PR_THP_DISABLE_EXCEPT_ADVISED, to be used as a flag in arg3 >> of prctl(PR_SET_THP_DISABLE) when disabling THPs (arg2 != 0). >> >> For now, arg3 was not allowed to be set (-EINVAL). Now it holds >> flags. >> >> (B) Makes prctl(PR_GET_THP_DISABLE) return 3 if >> PR_THP_DISABLE_EXCEPT_ADVISED was set while disabling. >> >> For now, it would return 1 if THPs were disabled completely. Now >> it essentially returns the set flags as well. >> >> (C) Renames MMF_DISABLE_THP to MMF_DISABLE_THP_COMPLETELY, to express >> the semantics clearly. >> >> Fortunately, there are only two instances outside of prctl() code. >> >> (D) Adds MMF_DISABLE_THP_EXCEPT_ADVISED to express "no THP except for VMAs >> with VM_HUGEPAGE" -- essentially "thp=madvise" behavior >> >> Fortunately, we only have to extend vma_thp_disabled(). >> >> (E) Indicates "THP_enabled: 0" in /proc/pid/status only if THPs are not >> disabled completely >> >> Only indicating that THPs are disabled when they are really disabled >> completely, not only partially. >> >> The documented semantics in the man page for PR_SET_THP_DISABLE >> "is inherited by a child created via fork(2) and is preserved across >> execve(2)" is maintained. This behavior, for example, allows for >> disabling THPs for a workload through the launching process (e.g., >> systemd where we fork() a helper process to then exec()). >> >> There is currently not way to prevent that a process will not issue >> PR_SET_THP_DISABLE itself to re-enable THP. We could add a "seal" option >> to PR_SET_THP_DISABLE through another flag if ever required. The known >> users (such as redis) really use PR_SET_THP_DISABLE to disable THPs, so >> that is not added for now. >> >> Cc: Jonathan Corbet >> Cc: Andrew Morton >> Cc: Lorenzo Stoakes >> Cc: Zi Yan >> Cc: Baolin Wang >> Cc: "Liam R. Howlett" >> Cc: Nico Pache >> Cc: Ryan Roberts >> Cc: Dev Jain >> Cc: Barry Song >> Cc: Vlastimil Babka >> Cc: Mike Rapoport >> Cc: Suren Baghdasaryan >> Cc: Michal Hocko >> Cc: Usama Arif >> Cc: SeongJae Park >> Cc: Jann Horn >> Cc: Liam R. Howlett >> Cc: Yafang Shao >> Cc: Matthew Wilcox >> Signed-off-by: David Hildenbrand >> >> --- >> >> At first, I thought of "why not simply relax PR_SET_THP_DISABLE", but I >> think there might be real use cases where we want to disable any THPs -- >> in particular also around debugging THP-related problems, and >> "THP=never" not meaning ... "never" anymore. PR_SET_THP_DISABLE will >> also block MADV_COLLAPSE, which can be very helpful. Of course, I thought >> of having a system-wide config to change PR_SET_THP_DISABLE behavior, but >> I just don't like the semantics. >> >> "prctl: allow overriding system THP policy to always"[1] proposed >> "overriding policies to always", which is just the wrong way around: we >> should not add mechanisms to "enable more" when we already have an >> interface/mechanism to "disable" them (PR_SET_THP_DISABLE). It all gets >> weird otherwise. >> >> "[PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY"[2] proposed >> setting the default of the VM_HUGEPAGE, which is similarly the wrong way >> around I think now. >> >> The proposals by Lorenzo to extend process_madvise()[3] and mctrl()[4] >> similarly were around the "default for VM_HUGEPAGE" idea, but after the >> discussion, I think we should better leave VM_HUGEPAGE untouched. >> >> Happy to hear naming suggestions for "PR_THP_DISABLE_EXCEPT_ADVISED" where >> we essentially want to say "leave advised regions alone" -- "keep THP >> enabled for advised regions", >> >> The only thing I really dislike about this is using another MMF_* flag, >> but well, no way around it -- and seems like we could easily support >> more than 32 if we want to, or storing this thp information elsewhere. >> >> I think this here (modifying an existing toggle) is the only prctl() >> extension that we might be willing to accept. In general, I agree like >> most others, that prctl() is a very bad interface for that -- but >> PR_SET_THP_DISABLE is already there and is getting used. >> >> Long-term, I think the answer will be something based on bpf[5]. Maybe >> in that context, I there could still be value in easily disabling THPs for >> selected workloads (esp. debugging purposes). >> >> Jann raised valid concerns[6] about new flags that are persistent across >> exec[6]. As this here is a relaxation to existing PR_SET_THP_DISABLE I >> consider it having a similar security risk as our existing >> PR_SET_THP_DISABLE, but devil is in the detail. >> >> This is *completely* untested and might be utterly broken. It merely >> serves as a PoC of what I think could be done. If this ever goes upstream, >> we need some kselftests for it, and extensive tests. >> >> [1] https://lore.kernel.org/r/20250507141132.2773275-1-usamaarif642@gmail.com >> [2] https://lkml.kernel.org/r/20250515133519.2779639-2-usamaarif642@gmail.com >> [3] https://lore.kernel.org/r/cover.1747686021.git.lorenzo.stoakes@oracle.com >> [4] https://lkml.kernel.org/r/85778a76-7dc8-4ea8-8827-acb45f74ee05@lucifer.local >> [5] https://lkml.kernel.org/r/20250608073516.22415-1-laoar.shao@gmail.com >> [6] https://lore.kernel.org/r/CAG48ez3-7EnBVEjpdoW7z5K0hX41nLQN5Wb65Vg-1p8DdXRnjg@mail.gmail.com >> >> --- >> Documentation/filesystems/proc.rst | 5 +-- >> fs/proc/array.c | 2 +- >> include/linux/huge_mm.h | 20 ++++++++--- >> include/linux/mm_types.h | 13 +++---- >> include/uapi/linux/prctl.h | 7 ++++ >> kernel/sys.c | 58 +++++++++++++++++++++++------- >> mm/khugepaged.c | 2 +- >> 7 files changed, 78 insertions(+), 29 deletions(-) >> >> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst >> index 2971551b72353..915a3e44bc120 100644 >> --- a/Documentation/filesystems/proc.rst >> +++ b/Documentation/filesystems/proc.rst >> @@ -291,8 +291,9 @@ It's slow but very precise. >> HugetlbPages size of hugetlb memory portions >> CoreDumping process's memory is currently being dumped >> (killing the process may lead to a corrupted core) >> - THP_enabled process is allowed to use THP (returns 0 when >> - PR_SET_THP_DISABLE is set on the process >> + THP_enabled process is allowed to use THP (returns 0 when >> + PR_SET_THP_DISABLE is set on the process to disable >> + THP completely, not just partially) >> Threads number of threads >> SigQ number of signals queued/max. number for queue >> SigPnd bitmap of pending signals for the thread >> diff --git a/fs/proc/array.c b/fs/proc/array.c >> index d6a0369caa931..c4f91a784104f 100644 >> --- a/fs/proc/array.c >> +++ b/fs/proc/array.c >> @@ -422,7 +422,7 @@ static inline void task_thp_status(struct seq_file *m, struct mm_struct *mm) >> bool thp_enabled = IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE); >> >> if (thp_enabled) >> - thp_enabled = !test_bit(MMF_DISABLE_THP, &mm->flags); >> + thp_enabled = !test_bit(MMF_DISABLE_THP_COMPLETELY, &mm->flags); >> seq_printf(m, "THP_enabled:\t%d\n", thp_enabled); >> } >> >> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h >> index e0a27f80f390d..c4127104d9bc3 100644 >> --- a/include/linux/huge_mm.h >> +++ b/include/linux/huge_mm.h >> @@ -323,16 +323,26 @@ struct thpsize { >> (transparent_hugepage_flags & \ >> (1<> >> +/* >> + * Check whether THPs are explicitly disabled through madvise or prctl, or some >> + * architectures may disable THP for some mappings, for example, s390 kvm. >> + */ >> static inline bool vma_thp_disabled(struct vm_area_struct *vma, >> vm_flags_t vm_flags) >> { >> + /* Are THPs disabled for this VMA? */ >> + if (vm_flags & VM_NOHUGEPAGE) >> + return true; >> + /* Are THPs disabled for all VMAs in the whole process? */ >> + if (test_bit(MMF_DISABLE_THP_COMPLETELY, &vma->vm_mm->flags)) >> + return true; >> /* >> - * Explicitly disabled through madvise or prctl, or some >> - * architectures may disable THP for some mappings, for >> - * example, s390 kvm. >> + * Are THPs disabled only for VMAs where we didn't get an explicit >> + * advise to use them? >> */ >> - return (vm_flags & VM_NOHUGEPAGE) || >> - test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags); >> + if (vm_flags & VM_HUGEPAGE) >> + return false; >> + return test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, &vma->vm_mm->flags); >> } > > > Hi David, Hi! > > Over here, with MMF_DISABLE_THP_EXCEPT_ADVISED, MADV_HUGEPAGE will succeed as vm_flags has > VM_HUGEPAGE set, but MADV_COLLAPSE will fail to give a hugepage (as VM_HUGEPAGE is not set > and MMF_DISABLE_THP_EXCEPT_ADVISED is set) which I feel might not be the right behaviour > as MADV_COLLAPSE is "advise" and the prctl flag is PR_THP_DISABLE_EXCEPT_ADVISED? THPs are disabled for these regions, so it's at least consistent with the "disable all", but ... > > This will be checked in multiple places in madvise_collapse: thp_vma_allowable_order, > hugepage_vma_revalidate which calls thp_vma_allowable_order and hpage_collapse_scan_pmd > which also ends up calling hugepage_vma_revalidate. > > A hacky way would be to save and overwrite vma->vm_flags with VM_HUGEPAGE at the start of madvise_collapse > if VM_NOHUGEPAGE is not set, and reset vma->vm_flags to its original value at the end of madvise_collapse > (Not something I am recommending, just throwing it out there). Gah. > > Another possibility is to pass the fact that you are in madvise_collapse to these functions > as an argument, this might look ugly, although maybe not as ugly as hugepage_vma_revalidate > already has collapse control arg, so just need to take care of thp_vma_allowable_orders. Likely this. > > Any preference or better suggestions? What you are asking for is not MMF_DISABLE_THP_EXCEPT_ADVISED as I planned it, but MMF_DISABLE_THP_EXCEPT_ADVISED_OR_MADV_COLLAPSE. Now, one could consider MADV_COLLAPSE an "advise". (I am not opposed to that change) Indeed, the right way might be telling vma_thp_disabled() whether we are in collapse. Can you try implementing that on top of my patch to see how it looks? -- Cheers, David / dhildenb