All of lore.kernel.org
 help / color / mirror / Atom feed
From: Usama Arif <usamaarif642@gmail.com>
To: Yafang Shao <laoar.shao@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	david@redhat.com, linux-mm@kvack.org, hannes@cmpxchg.org,
	shakeel.butt@linux.dev, riel@surriel.com,
	baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com,
	Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com,
	linux-kernel@vger.kernel.org, kernel-team@meta.com
Subject: Re: [PATCH 0/1] prctl: allow overriding system THP policy to always
Date: Thu, 8 May 2025 17:04:37 +0100	[thread overview]
Message-ID: <ebfca8f2-40e5-485a-a060-621aa3a22376@gmail.com> (raw)
In-Reply-To: <CALOAHbCxhL=VM=E5UzNvQYZsrF4zdcQ1-49iEJ1UYvLsurtxCw@mail.gmail.com>



On 08/05/2025 06:41, Yafang Shao wrote:
> On Thu, May 8, 2025 at 12:09 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>>
>>
>> On 07/05/2025 16:57, Zi Yan wrote:
>>> On 7 May 2025, at 11:12, Usama Arif wrote:
>>>
>>>> On 07/05/2025 15:57, Zi Yan wrote:
>>>>> +Yafang, who is also looking at changing THP config at cgroup/container level.
> 
> Thanks
> 
>>>>>
>>>>> On 7 May 2025, at 10:00, Usama Arif wrote:
>>>>>
>>>>>> Allowing override of global THP policy per process allows workloads
>>>>>> that have shown to benefit from hugepages to do so, without regressing
>>>>>> workloads that wouldn't benefit. This will allow such types of
>>>>>> workloads to be run/stacked on the same machine.
>>>>>>
>>>>>> It also helps in rolling out hugepages in hyperscaler configurations
>>>>>> for workloads that benefit from them, where a single THP policy is
>>>>>> likely to be used across the entire fleet, and prctl will help override it.
>>>>>>
>>>>>> An advantage of doing it via prctl vs creating a cgroup specific
>>>>>> option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is
>>>>>> that this will work even when there are no cgroups present, and my
>>>>>> understanding is there is a strong preference of cgroups controls being
>>>>>> hierarchical which usually means them having a numerical value.
>>>>>
>>>>> Hi Usama,
>>>>>
>>>>> Do you mind giving an example on how to change THP policy for a set of
>>>>> processes running in a container (under a cgroup)?
>>>>
>>>> Hi Zi,
>>>>
>>>> In our case, we create the processes in the cgroup via systemd. The way we will enable THP=always
>>>> for processes in a cgroup is in the same way we enable KSM for the cgroup.
>>>> The change in systemd would be very similar to the line in [1], where we would set prctl PR_SET_THP_ALWAYS
>>>> in exec-invoke.
>>>> This is at the start of the process, but you would already know at the start of the process
>>>> whether you want THP=always for it or not.
>>>>
>>>> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045
>>>
>>> You also need to add a new systemd.directives, e.g., MemoryTHP, to
>>> pass the THP enablement or disablement info from a systemd config file.
>>> And if you find those processes do not benefit from using THPs,
>>> you can just change the new "MemoryTHP" config and restart the processes.
>>>
>>> Am I getting it? Thanks.
>>>
>>
>> Yes, thats right. They would exactly the same as what we (Meta) do
>> for KSM. So have MemoryTHP similar to MemroryKSM [1] and if MemoryTHP is set,
>> the ExecContext->memory_thp would be set similar to memory_ksm [2], and when
>> that is set, the prctl will be called at exec_invoke of the process [3].
>>
>> The systemd changes should be quite simple to do.
>>
>> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/man/systemd.exec.xml#L1978
>> [2] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/dbus-execute.c#L2151
>> [3] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045
> 
> This solution carries a risk: since prctl() does not require any
> capabilities, the task itself could call it and override your memory
> policy. While we could enforce CAP_SYS_RESOURCE to restrict this, that
> capability is typically enabled by default in containers, leaving them
> still vulnerable.
> 
> This approach might work for Kubernetes/container environments, but it
> would require substantial code changes to implement securely.
> 

You can already change the memory policy with prctl, for e.g. PR_SET_THP_DISABLE
already exists and the someone could use this to slow the process down. So the
approach this patch takes shouldn't be anymore of a security fix then what is already
exposed by the kernel. I think as you mentioned, if prctl is an issue CAP_SYS_RESOURCE
should be used to restrict this. 

In terms of security vulnerability of prctl, I feel like there are a lot of others
that can be a much much bigger issue? I just had a look and you can change the
seccomp, reset PAC keys(!) even speculation control(!!), so I dont think the security
argument would be valid.




  reply	other threads:[~2025-05-08 16:04 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-07 14:00 [PATCH 0/1] prctl: allow overriding system THP policy to always Usama Arif
2025-05-07 14:00 ` [PATCH 1/1] prctl: allow overriding system THP policy to always per process Usama Arif
2025-05-07 15:02   ` Usama Arif
2025-05-07 20:14   ` Zi Yan
2025-05-08 10:53     ` Usama Arif
2025-05-08 20:29       ` Zi Yan
2025-05-07 14:57 ` [PATCH 0/1] prctl: allow overriding system THP policy to always Zi Yan
2025-05-07 15:12   ` Usama Arif
2025-05-07 15:57     ` Zi Yan
2025-05-07 16:09       ` Usama Arif
2025-05-08  5:41         ` Yafang Shao
2025-05-08 16:04           ` Usama Arif [this message]
2025-05-09  2:15             ` Yafang Shao
2025-05-09  5:13               ` Johannes Weiner
2025-05-09  9:24                 ` Yafang Shao
2025-05-09  9:30                   ` David Hildenbrand
2025-05-09  9:43                     ` Yafang Shao
2025-05-09 16:46                       ` Johannes Weiner
2025-05-09 22:42                         ` David Hildenbrand
2025-05-09 23:34                           ` Zi Yan
2025-05-11  8:15                             ` David Hildenbrand
2025-05-11 14:08                               ` Usama Arif
2025-05-13 11:43                                 ` Yafang Shao
2025-05-13 12:04                                 ` David Hildenbrand
2025-05-11  2:08                         ` Yafang Shao
2025-05-08 11:06 ` David Hildenbrand
2025-05-08 16:35   ` Usama Arif
2025-05-08 17:39     ` David Hildenbrand
2025-05-08 18:05       ` Usama Arif

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ebfca8f2-40e5-485a-a060-621aa3a22376@gmail.com \
    --to=usamaarif642@gmail.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=david@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@meta.com \
    --cc=laoar.shao@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=npache@redhat.com \
    --cc=riel@surriel.com \
    --cc=ryan.roberts@arm.com \
    --cc=shakeel.butt@linux.dev \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.