From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 30E752ECD2C for ; Thu, 24 Jul 2025 19:16:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753384580; cv=none; b=u5jztu4ptrj/SuEZJRNLpdsfbFEW+PoXEKVEKwg7hpWZxgODkX/hDZH3fL2tKv9OfjktCmFEb1xpXORpEpKra4CETlsbTdZ4cNrrIUjC2965SmoVUGPuMuSU4w+afwqKZ+9QjmXeY5q59ARCuppr+XLoBbIFUSPIHLioDAAK+ho= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753384580; c=relaxed/simple; bh=itEhRIVR9wH0XL59GwhTw36oCEtz/+K23ATwCNdEu10=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=XltAG6Y/U3AlRRvAeiq1mEvWtuh92NYl2jjskVkg5BmqkPNp9h+wAqgZbIMz0BjJBAa1hax7wC7fztpTC2ihEJDb55l+EClKtJkub9iVTS4kl7zCVbMvmOWKsnLcXginFhm4r8Vsb0YGpseAjngjvCeecRtMWuYo+F2BuxD82oI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=boc0fYU3; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="boc0fYU3" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1753384577; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:autocrypt:autocrypt; bh=JroCquJ2N/9P4PgYxL/uw2uhd9P40fm8biIZJBC9nOc=; b=boc0fYU30CgQYdLYaAxL3DHDaydgnBy5CTEi9RcoiDZRxbW0cRZGtyFhrZf2rk5OdKbuUe jR1snAKXmASstSPAjlu+f+rR5Qsbi+BUROexqLhMsaByf3TReiZs2nz7YrWg9zkfZheWc+ PLyns03ZIP6Bjd+OCOdA2XUP9+TA5Qs= Received: from mail-ej1-f69.google.com (mail-ej1-f69.google.com [209.85.218.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-546-IuSA1o83PkSN49A_hwKQGw-1; Thu, 24 Jul 2025 15:16:15 -0400 X-MC-Unique: IuSA1o83PkSN49A_hwKQGw-1 X-Mimecast-MFC-AGG-ID: IuSA1o83PkSN49A_hwKQGw_1753384574 Received: by mail-ej1-f69.google.com with SMTP id a640c23a62f3a-ae708b158f2so127107666b.3 for ; Thu, 24 Jul 2025 12:16:15 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1753384574; x=1753989374; h=content-transfer-encoding:in-reply-to:organization:autocrypt :content-language:from:references:cc:to:subject:user-agent :mime-version:date:message-id:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=JroCquJ2N/9P4PgYxL/uw2uhd9P40fm8biIZJBC9nOc=; b=pzzKN2oFH9EE7EbCD0vGBflzLegqFr4FyD7xaEkfio4HAanLSjzXE0d1okQMqg6kG/ WNCezsD3rnsn+vjMZ0m9xKcsUidqJckG7F+4X1UqEWlP3zeQB53K/lEI1Yabxlb+YK+u 9cknHNqx1Zb2QB+E3yQQr32OyVqKUu+8oIJNmJXvT6JQpeA5WSx35HfBlWECCdd3p3il 2gpgbbZBY75Zwf5tUinydfUEnMsUti/qXdtOulkBy9SwnZ4fwZ5scuzH2IcuXv5fby2M d3X4ONbIacQNHBFOo/sN+YPmwfe3U6w1SAGrCSrOwTcEk+OtjxPDhSEkBFigWO9NaPLC s7zg== X-Forwarded-Encrypted: i=1; AJvYcCU6C/oYXckUCxqld9LUQX/MaKWL5pnsPYnnBImtT3pY24WdYHvIrWUBMTwSYky8LSoARVk2nicuY+rw77ed@vger.kernel.org X-Gm-Message-State: AOJu0YzDNWOArY77FzxylRPhoe2cxXoTTYlSnoFhGkTbAgJLiQrC4gRf uDkFsjxONEZghjjohyu9aESc/MXjPg7NDzMcHFqnH81bhZNGj8F3rR2RDNwzXCFRWwrlpjQ2eQ0 JLv3lSFH0rSahHvdKNFR8LQFyp3hu0JTyDORwCZbDGQQlR0jm/Qk83oh0dQR+DRkJrntx/LC18c YVkw== X-Gm-Gg: ASbGncu4DviYwgkYGc6ZV5K27DusRBXrIQL3b4+fO7kMsQ2j7lSLDmmhythE/3fA5z4 3FgUSaL/5uCPsF8EeuwYNcciIZp6a2Lj09ct0A2yzJnOp4z/5muSbyclAPboAyOcSYiLcTBiF3K A7wUKFbfB1JECnnAx6NwheESS8nGkeQJ9SlB+3TekzOdSmJjeRVt4aRq1LoOv2udCehtVNaKh/K m93p1y67Ura4e1e24xiIPKQnhzX5x9klLxeBBskL1Ku3ulU07psCz5eadUXcByhuKftc8NRZ5Rb TCIpYOLkIi18Q72WvABOPwsdMnv4aajdgApSKzE8R8pAZ0RY57oNCLgaZ0TJhaILawimuzCEgHx kkpdmmgt4+OkHViODcIt6VfSttiQfX5rJhDT/wbi24bWUY6uo5Udc3UcJCDELH22TyLo= X-Received: by 2002:a17:907:7b96:b0:af2:3c43:b104 with SMTP id a640c23a62f3a-af2f927315bmr771967666b.54.1753384574247; Thu, 24 Jul 2025 12:16:14 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEu0koRqgoAG3Cni1RYCd4sAXADnOmfGnRrtCgM7GWwt7Ol/1KPVQcp4wXhSJm6qyxTuGAPAA== X-Received: by 2002:a05:6000:208a:b0:3a5:25e0:1851 with SMTP id ffacd0b85a97d-3b768eb077emr7272886f8f.7.1753384071638; Thu, 24 Jul 2025 12:07:51 -0700 (PDT) Received: from ?IPV6:2003:d8:2f01:5500:ba83:3fd7:6836:62f6? (p200300d82f015500ba833fd7683662f6.dip0.t-ipconnect.de. [2003:d8:2f01:5500:ba83:3fd7:6836:62f6]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-3b76fc605cfsm2899678f8f.13.2025.07.24.12.07.50 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 24 Jul 2025 12:07:51 -0700 (PDT) Message-ID: <601e015b-1f61-45e8-9db8-4e0d2bc1505e@redhat.com> Date: Thu, 24 Jul 2025 21:07:49 +0200 Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH POC] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE To: Usama Arif , linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-doc@vger.kernel.org, Jonathan Corbet , Andrew Morton , Lorenzo Stoakes , Zi Yan , Baolin Wang , "Liam R. Howlett" , Nico Pache , Ryan Roberts , Dev Jain , Barry Song , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , SeongJae Park , Jann Horn , Yafang Shao , Matthew Wilcox , Johannes Weiner References: <20250721090942.274650-1-david@redhat.com> <3ec01250-0ff3-4d04-9009-7b85b6058e41@gmail.com> From: David Hildenbrand Content-Language: en-US Autocrypt: addr=david@redhat.com; keydata= xsFNBFXLn5EBEAC+zYvAFJxCBY9Tr1xZgcESmxVNI/0ffzE/ZQOiHJl6mGkmA1R7/uUpiCjJ dBrn+lhhOYjjNefFQou6478faXE6o2AhmebqT4KiQoUQFV4R7y1KMEKoSyy8hQaK1umALTdL QZLQMzNE74ap+GDK0wnacPQFpcG1AE9RMq3aeErY5tujekBS32jfC/7AnH7I0v1v1TbbK3Gp XNeiN4QroO+5qaSr0ID2sz5jtBLRb15RMre27E1ImpaIv2Jw8NJgW0k/D1RyKCwaTsgRdwuK Kx/Y91XuSBdz0uOyU/S8kM1+ag0wvsGlpBVxRR/xw/E8M7TEwuCZQArqqTCmkG6HGcXFT0V9 PXFNNgV5jXMQRwU0O/ztJIQqsE5LsUomE//bLwzj9IVsaQpKDqW6TAPjcdBDPLHvriq7kGjt WhVhdl0qEYB8lkBEU7V2Yb+SYhmhpDrti9Fq1EsmhiHSkxJcGREoMK/63r9WLZYI3+4W2rAc UucZa4OT27U5ZISjNg3Ev0rxU5UH2/pT4wJCfxwocmqaRr6UYmrtZmND89X0KigoFD/XSeVv jwBRNjPAubK9/k5NoRrYqztM9W6sJqrH8+UWZ1Idd/DdmogJh0gNC0+N42Za9yBRURfIdKSb B3JfpUqcWwE7vUaYrHG1nw54pLUoPG6sAA7Mehl3nd4pZUALHwARAQABzSREYXZpZCBIaWxk ZW5icmFuZCA8ZGF2aWRAcmVkaGF0LmNvbT7CwZgEEwEIAEICGwMGCwkIBwMCBhUIAgkKCwQW AgMBAh4BAheAAhkBFiEEG9nKrXNcTDpGDfzKTd4Q9wD/g1oFAmgsLPQFCRvGjuMACgkQTd4Q 9wD/g1o0bxAAqYC7gTyGj5rZwvy1VesF6YoQncH0yI79lvXUYOX+Nngko4v4dTlOQvrd/vhb 02e9FtpA1CxgwdgIPFKIuXvdSyXAp0xXuIuRPQYbgNriQFkaBlHe9mSf8O09J3SCVa/5ezKM OLW/OONSV/Fr2VI1wxAYj3/Rb+U6rpzqIQ3Uh/5Rjmla6pTl7Z9/o1zKlVOX1SxVGSrlXhqt kwdbjdj/csSzoAbUF/duDuhyEl11/xStm/lBMzVuf3ZhV5SSgLAflLBo4l6mR5RolpPv5wad GpYS/hm7HsmEA0PBAPNb5DvZQ7vNaX23FlgylSXyv72UVsObHsu6pT4sfoxvJ5nJxvzGi69U s1uryvlAfS6E+D5ULrV35taTwSpcBAh0/RqRbV0mTc57vvAoXofBDcs3Z30IReFS34QSpjvl Hxbe7itHGuuhEVM1qmq2U72ezOQ7MzADbwCtn+yGeISQqeFn9QMAZVAkXsc9Wp0SW/WQKb76 FkSRalBZcc2vXM0VqhFVzTb6iNqYXqVKyuPKwhBunhTt6XnIfhpRgqveCPNIasSX05VQR6/a OBHZX3seTikp7A1z9iZIsdtJxB88dGkpeMj6qJ5RLzUsPUVPodEcz1B5aTEbYK6428H8MeLq NFPwmknOlDzQNC6RND8Ez7YEhzqvw7263MojcmmPcLelYbfOwU0EVcufkQEQAOfX3n0g0fZz Bgm/S2zF/kxQKCEKP8ID+Vz8sy2GpDvveBq4H2Y34XWsT1zLJdvqPI4af4ZSMxuerWjXbVWb T6d4odQIG0fKx4F8NccDqbgHeZRNajXeeJ3R7gAzvWvQNLz4piHrO/B4tf8svmRBL0ZB5P5A 2uhdwLU3NZuK22zpNn4is87BPWF8HhY0L5fafgDMOqnf4guJVJPYNPhUFzXUbPqOKOkL8ojk CXxkOFHAbjstSK5Ca3fKquY3rdX3DNo+EL7FvAiw1mUtS+5GeYE+RMnDCsVFm/C7kY8c2d0G NWkB9pJM5+mnIoFNxy7YBcldYATVeOHoY4LyaUWNnAvFYWp08dHWfZo9WCiJMuTfgtH9tc75 7QanMVdPt6fDK8UUXIBLQ2TWr/sQKE9xtFuEmoQGlE1l6bGaDnnMLcYu+Asp3kDT0w4zYGsx 5r6XQVRH4+5N6eHZiaeYtFOujp5n+pjBaQK7wUUjDilPQ5QMzIuCL4YjVoylWiBNknvQWBXS lQCWmavOT9sttGQXdPCC5ynI+1ymZC1ORZKANLnRAb0NH/UCzcsstw2TAkFnMEbo9Zu9w7Kv AxBQXWeXhJI9XQssfrf4Gusdqx8nPEpfOqCtbbwJMATbHyqLt7/oz/5deGuwxgb65pWIzufa N7eop7uh+6bezi+rugUI+w6DABEBAAHCwXwEGAEIACYCGwwWIQQb2cqtc1xMOkYN/MpN3hD3 AP+DWgUCaCwtJQUJG8aPFAAKCRBN3hD3AP+DWlDnD/4k2TW+HyOOOePVm23F5HOhNNd7nNv3 Vq2cLcW1DteHUdxMO0X+zqrKDHI5hgnE/E2QH9jyV8mB8l/ndElobciaJcbl1cM43vVzPIWn 01vW62oxUNtEvzLLxGLPTrnMxWdZgxr7ACCWKUnMGE2E8eca0cT2pnIJoQRz242xqe/nYxBB /BAK+dsxHIfcQzl88G83oaO7vb7s/cWMYRKOg+WIgp0MJ8DO2IU5JmUtyJB+V3YzzM4cMic3 bNn8nHjTWw/9+QQ5vg3TXHZ5XMu9mtfw2La3bHJ6AybL0DvEkdGxk6YHqJVEukciLMWDWqQQ RtbBhqcprgUxipNvdn9KwNpGciM+hNtM9kf9gt0fjv79l/FiSw6KbCPX9b636GzgNy0Ev2UV m00EtcpRXXMlEpbP4V947ufWVK2Mz7RFUfU4+ETDd1scMQDHzrXItryHLZWhopPI4Z+ps0rB CQHfSpl+wG4XbJJu1D8/Ww3FsO42TMFrNr2/cmqwuUZ0a0uxrpkNYrsGjkEu7a+9MheyTzcm vyU2knz5/stkTN2LKz5REqOe24oRnypjpAfaoxRYXs+F8wml519InWlwCra49IUSxD1hXPxO WBe5lqcozu9LpNDH/brVSzHCSb7vjNGvvSVESDuoiHK8gNlf0v+epy5WYd7CGAgODPvDShGN g3eXuA== Organization: Red Hat In-Reply-To: <3ec01250-0ff3-4d04-9009-7b85b6058e41@gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit On 24.07.25 20:57, Usama Arif wrote: > > > On 21/07/2025 10:09, David Hildenbrand wrote: >> People want to make use of more THPs, for example, moving from >> THP=never to THP=madvise, or from THP=madvise to THP=never. >> >> While this is great news for every THP desperately waiting to get >> allocated out there, apparently there are some workloads that require a >> bit of care during that transition: once problems are detected, these >> workloads should be started with the old behavior, without making all >> other workloads on the system go back to the old behavior as well. >> >> In essence, the following scenarios are imaginable: >> >> (1) Switch from THP=none to THP=madvise or THP=always, but keep the old >> behavior (no THP) for selected workloads. >> >> (2) Stay at THP=none, but have "madvise" or "always" behavior for >> selected workloads. >> >> (3) Switch from THP=madvise to THP=always, but keep the old behavior >> (THP only when advised) for selected workloads. >> >> (4) Stay at THP=madvise, but have "always" behavior for selected >> workloads. >> >> In essence, (2) can be emulated through (1), by setting THP!=none while >> disabling THPs for all processes that don't want THPs. It requires >> configuring all workloads, but that is a user-space problem to sort out. >> >> (4) can be emulated through (3) in a similar way. >> >> Back when (1) was relevant in the past, as people started enabling THPs, >> we added PR_SET_THP_DISABLE, so relevant workloads that were not ready >> yet (i.e., used by Redis) were able to just disable THPs completely. Redis >> still implements the option to use this interface to disable THPs >> completely. >> >> With PR_SET_THP_DISABLE, we added a way to force-disable THPs for a >> workload -- a process, including fork+exec'ed process hierarchy. >> That essentially made us support (1): simply disable THPs for all workloads >> that are not ready for THPs yet, while still enabling THPs system-wide. >> >> The quest for handling (3) and (4) started, but current approaches >> (completely new prctl, options to set other policies per processm, >> alternatives to prctl -- mctrl, cgroup handling) don't look particularly >> promising. Likely, the future will use bpf or something similar to >> implement better policies, in particular to also make better decisions >> about THP sizes to use, but this will certainly take a while as that work >> just started. >> >> Long story short: a simple enable/disable is not really suitable for the >> future, so we're not willing to add completely new toggles. >> >> While we could emulate (3)+(4) through (1)+(2) by simply disabling THPs >> completely for these processes, this scares many THPs in our system >> because they could no longer get allocated where they used to be allocated >> for: regions flagged as VM_HUGEPAGE. Apparently, that imposes a >> problem for relevant workloads, because "not THPs" is certainly worse >> than "THPs only when advised". >> >> Could we simply relax PR_SET_THP_DISABLE, to "disable THPs unless not >> explicitly advised by the app through MAD_HUGEPAGE"? *maybe*, but this >> would change the documented semantics quite a bit, and the versatility >> to use it for debugging purposes, so I am not 100% sure that is what we >> want -- although it would certainly be much easier. >> >> So instead, as an easy way forward for (3) and (4), an option to >> make PR_SET_THP_DISABLE disable *less* THPs for a process. >> >> In essence, this patch: >> >> (A) Adds PR_THP_DISABLE_EXCEPT_ADVISED, to be used as a flag in arg3 >> of prctl(PR_SET_THP_DISABLE) when disabling THPs (arg2 != 0). >> >> For now, arg3 was not allowed to be set (-EINVAL). Now it holds >> flags. >> >> (B) Makes prctl(PR_GET_THP_DISABLE) return 3 if >> PR_THP_DISABLE_EXCEPT_ADVISED was set while disabling. >> >> For now, it would return 1 if THPs were disabled completely. Now >> it essentially returns the set flags as well. >> >> (C) Renames MMF_DISABLE_THP to MMF_DISABLE_THP_COMPLETELY, to express >> the semantics clearly. >> >> Fortunately, there are only two instances outside of prctl() code. >> >> (D) Adds MMF_DISABLE_THP_EXCEPT_ADVISED to express "no THP except for VMAs >> with VM_HUGEPAGE" -- essentially "thp=madvise" behavior >> >> Fortunately, we only have to extend vma_thp_disabled(). >> >> (E) Indicates "THP_enabled: 0" in /proc/pid/status only if THPs are not >> disabled completely >> >> Only indicating that THPs are disabled when they are really disabled >> completely, not only partially. >> >> The documented semantics in the man page for PR_SET_THP_DISABLE >> "is inherited by a child created via fork(2) and is preserved across >> execve(2)" is maintained. This behavior, for example, allows for >> disabling THPs for a workload through the launching process (e.g., >> systemd where we fork() a helper process to then exec()). >> >> There is currently not way to prevent that a process will not issue >> PR_SET_THP_DISABLE itself to re-enable THP. We could add a "seal" option >> to PR_SET_THP_DISABLE through another flag if ever required. The known >> users (such as redis) really use PR_SET_THP_DISABLE to disable THPs, so >> that is not added for now. >> >> Cc: Jonathan Corbet >> Cc: Andrew Morton >> Cc: Lorenzo Stoakes >> Cc: Zi Yan >> Cc: Baolin Wang >> Cc: "Liam R. Howlett" >> Cc: Nico Pache >> Cc: Ryan Roberts >> Cc: Dev Jain >> Cc: Barry Song >> Cc: Vlastimil Babka >> Cc: Mike Rapoport >> Cc: Suren Baghdasaryan >> Cc: Michal Hocko >> Cc: Usama Arif >> Cc: SeongJae Park >> Cc: Jann Horn >> Cc: Liam R. Howlett >> Cc: Yafang Shao >> Cc: Matthew Wilcox >> Signed-off-by: David Hildenbrand >> >> --- >> >> At first, I thought of "why not simply relax PR_SET_THP_DISABLE", but I >> think there might be real use cases where we want to disable any THPs -- >> in particular also around debugging THP-related problems, and >> "THP=never" not meaning ... "never" anymore. PR_SET_THP_DISABLE will >> also block MADV_COLLAPSE, which can be very helpful. Of course, I thought >> of having a system-wide config to change PR_SET_THP_DISABLE behavior, but >> I just don't like the semantics. >> >> "prctl: allow overriding system THP policy to always"[1] proposed >> "overriding policies to always", which is just the wrong way around: we >> should not add mechanisms to "enable more" when we already have an >> interface/mechanism to "disable" them (PR_SET_THP_DISABLE). It all gets >> weird otherwise. >> >> "[PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY"[2] proposed >> setting the default of the VM_HUGEPAGE, which is similarly the wrong way >> around I think now. >> >> The proposals by Lorenzo to extend process_madvise()[3] and mctrl()[4] >> similarly were around the "default for VM_HUGEPAGE" idea, but after the >> discussion, I think we should better leave VM_HUGEPAGE untouched. >> >> Happy to hear naming suggestions for "PR_THP_DISABLE_EXCEPT_ADVISED" where >> we essentially want to say "leave advised regions alone" -- "keep THP >> enabled for advised regions", >> >> The only thing I really dislike about this is using another MMF_* flag, >> but well, no way around it -- and seems like we could easily support >> more than 32 if we want to, or storing this thp information elsewhere. >> >> I think this here (modifying an existing toggle) is the only prctl() >> extension that we might be willing to accept. In general, I agree like >> most others, that prctl() is a very bad interface for that -- but >> PR_SET_THP_DISABLE is already there and is getting used. >> >> Long-term, I think the answer will be something based on bpf[5]. Maybe >> in that context, I there could still be value in easily disabling THPs for >> selected workloads (esp. debugging purposes). >> >> Jann raised valid concerns[6] about new flags that are persistent across >> exec[6]. As this here is a relaxation to existing PR_SET_THP_DISABLE I >> consider it having a similar security risk as our existing >> PR_SET_THP_DISABLE, but devil is in the detail. >> >> This is *completely* untested and might be utterly broken. It merely >> serves as a PoC of what I think could be done. If this ever goes upstream, >> we need some kselftests for it, and extensive tests. >> >> [1] https://lore.kernel.org/r/20250507141132.2773275-1-usamaarif642@gmail.com >> [2] https://lkml.kernel.org/r/20250515133519.2779639-2-usamaarif642@gmail.com >> [3] https://lore.kernel.org/r/cover.1747686021.git.lorenzo.stoakes@oracle.com >> [4] https://lkml.kernel.org/r/85778a76-7dc8-4ea8-8827-acb45f74ee05@lucifer.local >> [5] https://lkml.kernel.org/r/20250608073516.22415-1-laoar.shao@gmail.com >> [6] https://lore.kernel.org/r/CAG48ez3-7EnBVEjpdoW7z5K0hX41nLQN5Wb65Vg-1p8DdXRnjg@mail.gmail.com >> >> --- >> Documentation/filesystems/proc.rst | 5 +-- >> fs/proc/array.c | 2 +- >> include/linux/huge_mm.h | 20 ++++++++--- >> include/linux/mm_types.h | 13 +++---- >> include/uapi/linux/prctl.h | 7 ++++ >> kernel/sys.c | 58 +++++++++++++++++++++++------- >> mm/khugepaged.c | 2 +- >> 7 files changed, 78 insertions(+), 29 deletions(-) >> >> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst >> index 2971551b72353..915a3e44bc120 100644 >> --- a/Documentation/filesystems/proc.rst >> +++ b/Documentation/filesystems/proc.rst >> @@ -291,8 +291,9 @@ It's slow but very precise. >> HugetlbPages size of hugetlb memory portions >> CoreDumping process's memory is currently being dumped >> (killing the process may lead to a corrupted core) >> - THP_enabled process is allowed to use THP (returns 0 when >> - PR_SET_THP_DISABLE is set on the process >> + THP_enabled process is allowed to use THP (returns 0 when >> + PR_SET_THP_DISABLE is set on the process to disable >> + THP completely, not just partially) >> Threads number of threads >> SigQ number of signals queued/max. number for queue >> SigPnd bitmap of pending signals for the thread >> diff --git a/fs/proc/array.c b/fs/proc/array.c >> index d6a0369caa931..c4f91a784104f 100644 >> --- a/fs/proc/array.c >> +++ b/fs/proc/array.c >> @@ -422,7 +422,7 @@ static inline void task_thp_status(struct seq_file *m, struct mm_struct *mm) >> bool thp_enabled = IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE); >> >> if (thp_enabled) >> - thp_enabled = !test_bit(MMF_DISABLE_THP, &mm->flags); >> + thp_enabled = !test_bit(MMF_DISABLE_THP_COMPLETELY, &mm->flags); >> seq_printf(m, "THP_enabled:\t%d\n", thp_enabled); >> } >> >> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h >> index e0a27f80f390d..c4127104d9bc3 100644 >> --- a/include/linux/huge_mm.h >> +++ b/include/linux/huge_mm.h >> @@ -323,16 +323,26 @@ struct thpsize { >> (transparent_hugepage_flags & \ >> (1<> >> +/* >> + * Check whether THPs are explicitly disabled through madvise or prctl, or some >> + * architectures may disable THP for some mappings, for example, s390 kvm. >> + */ >> static inline bool vma_thp_disabled(struct vm_area_struct *vma, >> vm_flags_t vm_flags) >> { >> + /* Are THPs disabled for this VMA? */ >> + if (vm_flags & VM_NOHUGEPAGE) >> + return true; >> + /* Are THPs disabled for all VMAs in the whole process? */ >> + if (test_bit(MMF_DISABLE_THP_COMPLETELY, &vma->vm_mm->flags)) >> + return true; >> /* >> - * Explicitly disabled through madvise or prctl, or some >> - * architectures may disable THP for some mappings, for >> - * example, s390 kvm. >> + * Are THPs disabled only for VMAs where we didn't get an explicit >> + * advise to use them? >> */ >> - return (vm_flags & VM_NOHUGEPAGE) || >> - test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags); >> + if (vm_flags & VM_HUGEPAGE) >> + return false; >> + return test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, &vma->vm_mm->flags); >> } > > > Hi David, Hi! > > Over here, with MMF_DISABLE_THP_EXCEPT_ADVISED, MADV_HUGEPAGE will succeed as vm_flags has > VM_HUGEPAGE set, but MADV_COLLAPSE will fail to give a hugepage (as VM_HUGEPAGE is not set > and MMF_DISABLE_THP_EXCEPT_ADVISED is set) which I feel might not be the right behaviour > as MADV_COLLAPSE is "advise" and the prctl flag is PR_THP_DISABLE_EXCEPT_ADVISED? THPs are disabled for these regions, so it's at least consistent with the "disable all", but ... > > This will be checked in multiple places in madvise_collapse: thp_vma_allowable_order, > hugepage_vma_revalidate which calls thp_vma_allowable_order and hpage_collapse_scan_pmd > which also ends up calling hugepage_vma_revalidate. > > A hacky way would be to save and overwrite vma->vm_flags with VM_HUGEPAGE at the start of madvise_collapse > if VM_NOHUGEPAGE is not set, and reset vma->vm_flags to its original value at the end of madvise_collapse > (Not something I am recommending, just throwing it out there). Gah. > > Another possibility is to pass the fact that you are in madvise_collapse to these functions > as an argument, this might look ugly, although maybe not as ugly as hugepage_vma_revalidate > already has collapse control arg, so just need to take care of thp_vma_allowable_orders. Likely this. > > Any preference or better suggestions? What you are asking for is not MMF_DISABLE_THP_EXCEPT_ADVISED as I planned it, but MMF_DISABLE_THP_EXCEPT_ADVISED_OR_MADV_COLLAPSE. Now, one could consider MADV_COLLAPSE an "advise". (I am not opposed to that change) Indeed, the right way might be telling vma_thp_disabled() whether we are in collapse. Can you try implementing that on top of my patch to see how it looks? -- Cheers, David / dhildenb