Re: [PATCH v2] sched/numa: Introduce per cgroup numa balance control

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Libo Chen <libo.chen@oracle.com>
To: "Chen, Yu C" <yu.c.chen@intel.com>
Cc: "Juri Lelli" <juri.lelli@redhat.com>,
	"Ben Segall" <bsegall@google.com>, "Mel Gorman" <mgorman@suse.de>,
	"Valentin Schneider" <vschneid@redhat.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	"Lorenzo Stoakes" <lorenzo.stoakes@oracle.com>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	"Phil Auld" <pauld@redhat.com>, "Tejun Heo" <tj@kernel.org>,
	"Daniel Jordan" <daniel.m.jordan@oracle.com>,
	"Jann Horn" <jannh@google.com>,
	"Pedro Falcato" <pfalcato@suse.de>,
	"Aubrey Li" <aubrey.li@intel.com>,
	"Tim Chen" <tim.c.chen@intel.com>,
	"Huang, Ying" <ying.huang@linux.alibaba.com>,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, "Jonathan Corbet" <corbet@lwn.net>,
	"Ingo Molnar" <mingo@redhat.com>,
	"Michal Koutný" <mkoutny@suse.com>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Shakeel Butt" <shakeel.butt@linux.dev>
Subject: Re: [PATCH v2] sched/numa: Introduce per cgroup numa balance control
Date: Tue, 1 Jul 2025 10:58:38 -0700	[thread overview]
Message-ID: <82fb706f-b12d-4340-8f8c-6f949ab82913@oracle.com> (raw)
In-Reply-To: <b09bd5f9-d029-40a0-a853-2a90ef969854@intel.com>

Hi Chenyu

On 7/1/25 09:36, Chen, Yu C wrote:
> Hi Libo,
> 
> On 7/1/2025 4:25 PM, Libo Chen wrote:
>> Hi Chenyu,
>>
>> Thanks for the patch. See my comments below.
>>
>> On 6/25/25 03:23, Chen Yu wrote:
>>> [Problem Statement]
>>> Currently, NUMA balancing is configured system-wide.
>>> However, in some production environments, different
>>> cgroups may have varying requirements for NUMA balancing.
>>> Some cgroups are CPU-intensive, while others are
>>> memory-intensive. Some do not benefit from NUMA balancing
>>> due to the overhead associated with VMA scanning, while
>>> others prefer NUMA balancing as it helps improve memory
>>> locality. In this case, system-wide NUMA balancing is
>>> usually disabled to avoid causing regressions.
>>>
>>> [Proposal]
>>> Introduce a per-cgroup interface to enable NUMA balancing
>>> for specific cgroups. This interface is associated with
>>> the CPU subsystem, which does not support threaded subtrees,
>>> and close to CPU bandwidth control. The system administrator
>>> needs to set the NUMA balancing mode to
>>> NUMA_BALANCING_CGROUP=4 to enable this feature. When the
>>> system is in NUMA_BALANCING_CGROUP mode, NUMA balancing
>>> for all cgroups is disabled by default. After the
>>> administrator enables this feature for a specific cgroup,
>>> NUMA balancing for that cgroup is enabled.
>>>
>>> A simple example to show how to use per-cgroup NUMA balancing:
>>>
>>> Step1
>>> //switch on per cgroup Numa balancing.
>>> //All cgroup's NUMA balance is disabled by default.
>>> echo 4 > /proc/sys/kernel/numa_balancing
>>>
>>> Step2
>>> //created a cgroup named mytest, enable its NUMA balancing
>>> echo 1 > /sys/fs/cgroup/mytest/cpu.numa_load_balance
>>>
>>> [Benchmark]
>>> Tested on Xeon Sapphire Rapids, with 4 Numa nodes. Created
>>> a cgroup mytest and launched autonumabench NUMA01_THREADLOCAL.
>>> Each test runs 6 cycles.
>>>
>>> baseline:
>>> 29360.56user 16280.68system 3:33.29elapsed
>>> 29788.41user 16060.31system 3:34.38elapsed
>>> 28307.51user 17043.45system 3:33.03elapsed
>>> 29552.49user 16307.65system 3:34.20elapsed
>>> 29847.41user 15966.15system 3:34.65elapsed
>>> 29111.10user 16532.78system 3:33.19elapsed
>>>
>>> per cgroup NUMA balance:
>>> 7589.78user 16494.90system 1:53.18elapsed
>>> 7795.54user 16537.65system 1:54.11elapsed
>>> 8295.66user 16391.21system 1:55.98elapsed
>>> 7836.34user 17312.31system 1:55.71elapsed
>>> 7773.26user 16856.19system 1:54.08elapsed
>>> 7534.43user 17604.58system 1:55.01elapsed
>>>
>>> The user time has been reduced to 33% of the
>>> original, and the elapsed time has dropped to
>>> 45% of the original (lower values are better).
>>>
>>> cat /sys/fs/cgroup/mytest/memory.stat | grep numa
>>> numa_pages_migrated 10238503
>>> numa_pte_updates 24378124
>>> numa_hint_faults 16921590
>>> numa_task_migrated 253
>>> numa_task_swapped 4
>>>
>>> to-do:
>>> Per-cgroup NUMA balancing should consider the
>>> hierarchy of the cgroup. Initially, NUMA balancing
>>> is disabled for the root cgroup. A cgroup that has
>>> NUMA balancing enabled should have all its parents
>>> enabled. For example, suppose cgroup A is the parent
>>> of cgroup B; if A.numa_load_balance is 0, even if
>>> B.numa_load_balance is 1, NUMA balancing for B is
>>> disabled. This idea is derived from
>>> commit e39925734909 ("mm/memcontrol: respect
>>> zswap.writeback setting from parent cgroup too").
>>>
>>> Suggested-by: Tim Chen <tim.c.chen@intel.com>
>>> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
>>> ---
>>> v1->v2:
>>>
>>> Add documentation in Documentation/admin-guide/sysctl/kernel.rst.
>>>
>>> Add comments in tg_numa_balance_enabled() to explain that
>>> the newly introduced NUMA balancing mode is naturally
>>> exclusive of existing NUMA balancing modes. (Tim)
>>> ---
>>>   Documentation/admin-guide/sysctl/kernel.rst |  6 ++++
>>>   include/linux/sched/sysctl.h                |  1 +
>>>   kernel/sched/core.c                         | 31 +++++++++++++++++++++
>>>   kernel/sched/fair.c                         | 28 +++++++++++++++++++
>>>   kernel/sched/sched.h                        |  3 ++
>>>   mm/mprotect.c                               |  5 ++--
>>>   6 files changed, 72 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
>>> index dd49a89a62d3..ff88d1153c19 100644
>>> --- a/Documentation/admin-guide/sysctl/kernel.rst
>>> +++ b/Documentation/admin-guide/sysctl/kernel.rst
>>> @@ -709,6 +709,7 @@ The value to set can be the result of ORing the following:
>>>   0 NUMA_BALANCING_DISABLED
>>>   1 NUMA_BALANCING_NORMAL
>>>   2 NUMA_BALANCING_MEMORY_TIERING
>>> +4 NUMA_BALANCING_CGROUP
>>>   = =================================
>>>     Or NUMA_BALANCING_NORMAL to optimize page placement among different
>>> @@ -729,6 +730,11 @@ different types of memory (represented as different NUMA nodes) to
>>>   place the hot pages in the fast memory.  This is implemented based on
>>>   unmapping and page fault too.
>>>   +Or NUMA_BALANCING_CGROUP to enable the per cgroup NUMA balancing.
>>> +This new behavior can be opted-in/out on a per-cgroup basis via a new
>>> +cgroup CPU subsystem file named numa_load_balance. By default, per
>>> +cgroup NUMA balancing for each cgroup is enabled.
>>> +
>>>   numa_balancing_promote_rate_limit_MBps
>>>   ======================================
>>>   diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
>>> index 5a64582b086b..1e4d5a9ddb26 100644
>>> --- a/include/linux/sched/sysctl.h
>>> +++ b/include/linux/sched/sysctl.h
>>> @@ -22,6 +22,7 @@ enum sched_tunable_scaling {
>>>   #define NUMA_BALANCING_DISABLED        0x0
>>>   #define NUMA_BALANCING_NORMAL        0x1
>>>   #define NUMA_BALANCING_MEMORY_TIERING    0x2
>>> +#define NUMA_BALANCING_CGROUP        0x4
>>>     #ifdef CONFIG_NUMA_BALANCING
>>>   extern int sysctl_numa_balancing_mode;
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index 8988d38d46a3..8e9aa59193df 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -10078,6 +10078,30 @@ static ssize_t cpu_max_write(struct kernfs_open_file *of,
>>>   }
>>>   #endif
>>>   +#ifdef CONFIG_NUMA_BALANCING
>>> +static int numa_balance_write_u64(struct cgroup_subsys_state *css,
>>> +                  struct cftype *cftype, u64 enable)
>>> +{
>>> +    struct task_group *tg;
>>> +    bool was_enabled;
>>> +
>>> +    tg = css_tg(css);
>>> +    was_enabled = READ_ONCE(tg->nlb_enabled);
>>> +    if (was_enabled == enable)
>>> +        return 0;
>>> +
>>> +    WRITE_ONCE(tg->nlb_enabled, enable);
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static u64 numa_balance_read_u64(struct cgroup_subsys_state *css,
>>> +                 struct cftype *cft)
>>> +{
>>> +    return READ_ONCE(css_tg(css)->nlb_enabled);
>>> +}
>>> +#endif /* CONFIG_NUMA_BALANCING */
>>> +
>>>   static struct cftype cpu_files[] = {
>>>   #ifdef CONFIG_GROUP_SCHED_WEIGHT
>>>       {
>>> @@ -10126,6 +10150,13 @@ static struct cftype cpu_files[] = {
>>>           .seq_show = cpu_uclamp_max_show,
>>>           .write = cpu_uclamp_max_write,
>>>       },
>>> +#endif
>>> +#ifdef CONFIG_NUMA_BALANCING
>>> +    {
>>> +        .name = "numa_load_balance",
>>> +        .read_u64 = numa_balance_read_u64,
>>> +        .write_u64 = numa_balance_write_u64,
>>> +    },
>>>   #endif
>>>       { }    /* terminate */
>>>   };
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 7a14da5396fb..dcdee3bf9960 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -3161,6 +3161,29 @@ void task_numa_free(struct task_struct *p, bool final)
>>>       }
>>>   }
>>>   +/*
>>> + * Return true if the NUMA balance is allowed for
>>> + * the task in a task group.
>>> + */
>>> +static bool tg_numa_balance_enabled(struct task_struct *p)
>>> +{
>>> +    /*
>>> +     * The min/max of sysctl_numa_balancing_mode ranges
>>> +     * from SYSCTL_ONE to SYSCTL_FOUR, so it is safe to
>>> +     * only check NUMA_BALANCING_CGROUP because it is
>>> +     * impossible to have both NUMA_BALANCING_CGROUP and
>>> +     * NUMA_BALANCING_NORMAL/NUMA_BALANCING_MEMORY_TIERING
>>> +     * set.
>>> +     */
>>> +    struct task_group *tg = task_group(p);
>>> +
>>> +    if (tg && (sysctl_numa_balancing_mode & NUMA_BALANCING_CGROUP) &&
>>> +        !READ_ONCE(tg->nlb_enabled))
>>> +        return false;
>>> +
>>> +    return true;
>>> +}
>>> +
>>>   /*
>>>    * Got a PROT_NONE fault for a page on @node.
>>>    */
>>> @@ -3189,6 +3212,9 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
>>>            !cpupid_valid(last_cpupid)))
>>>           return;
>>>   +    if (!tg_numa_balance_enabled(p))
>>> +        return;
>>> +
>>
>> I think this one may be redundant when you already have it in task_numa_work().  Without the
>> scanning, there won't be any hinting page faults on that task, so neither do_numa_page() nor
>> do_huge_pmd_numa_page() will be called. Though it's a minor issue if tg_numa_balance_enabled(p)
>> is fast.
>>
> 
> Previously I was thinking of the following sequence:
> 1. the NUMA balancing is enabled and task_numa_work() is invoked,
>    pages are scanned and PROT_NONE is set.
> 2. cgroup NUMA balancing is disabled by the user
> 3. do_numa_page() is triggered and PROT_NONE is cleared.
>    We don't want to do further task migration and
>    task_numa_fault() bails out.(page migration is still
>    allowed as we mainly want to control the behavior of
>    the task)
> 

Ah right, that makes sense. Does that fall under unlikely()? The timing window seems to be
quite small to me. 

>> Overall this is good. But more generally, I am thinking something finer-grained, like per-task
>> numab control with numab tunnables at task-level (if possible), that will be so much more useful
>> at least for us. There are use cases for controlling numa balancing at task level as applications
>> tuned for NUMA (that don't want numab mess with their tsk/mem placements) such as databases can
>> be in the same cgroup with other untuned applications, or not in a cgroup at all. Right now we
>> have to turn off numab globally but that's not really optimal in a lot of cases. I do understand
>> your use cases for per-cgroup control, but I wonder if there is a way to nicely combine them.
>> Per-task control should cover per-cgroup control functionality-wise, but it's an inconvenient
>> interface as one has to set for all tasks of the same cgroup, 
> 
> OK. Michal has also suggested using the per-task interface
> (prctl()/sched_setattr()) for NUMA balancing instead of per-cgroup
> control. In theory, I think it is doable. In a cloud environment,
> users can set the attribute (enable NUMA balancing) for the first
> process of a cgroup, and later child processes will inherit this
> attribute. But yes, when the admin decides to change this attribute,
> each process of the cgroup has to be iterated.
> 

An on/off button for cgroup could be added to libcgroup or similar for sysadmins, but still
inside the kernel, each task's attribute will be set one by one. What I am more worried about
is cgroups that frequently have tasks join and leave, it will become very annoying to set/unset
at each entry and exit. 

>> I haven't thought too hard about
>> it yet, just want to bring it out and see if we can work out something together.
>>
> 
> Sure, let me have a try on this per-task version and we can
> discuss/co-work on that.
> 

Cool~ Looking forward to it.

> thanks,
> Chenyu
> 
>> Thanks,
>> Libo
>>
>>>       /* Allocate buffer to track faults on a per-node basis */
>>>       if (unlikely(!p->numa_faults)) {
>>>           int size = sizeof(*p->numa_faults) *
>>> @@ -3330,6 +3356,8 @@ static void task_numa_work(struct callback_head *work)
>>>       if (p->flags & PF_EXITING)
>>>           return;
>>>   +    if (!tg_numa_balance_enabled(p))
>>> +        return;
>>>       /*
>>>        * Memory is pinned to only one NUMA node via cpuset.mems, naturally
>>>        * no page can be migrated.
>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>> index 475bb5998295..4b0dc656688e 100644
>>> --- a/kernel/sched/sched.h
>>> +++ b/kernel/sched/sched.h
>>> @@ -485,6 +485,9 @@ struct task_group {
>>>       /* Effective clamp values used for a task group */
>>>       struct uclamp_se    uclamp[UCLAMP_CNT];
>>>   #endif
>>> +#ifdef CONFIG_NUMA_BALANCING
>>> +    u64            nlb_enabled;
>>> +#endif
>>>     };
>>>   diff --git a/mm/mprotect.c b/mm/mprotect.c
>>> index 88608d0dc2c2..c288ffb92bfc 100644
>>> --- a/mm/mprotect.c
>>> +++ b/mm/mprotect.c
>>> @@ -155,10 +155,11 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>                   toptier = node_is_toptier(nid);
>>>                     /*
>>> -                 * Skip scanning top tier node if normal numa
>>> +                 * Skip scanning top tier node if normal and cgroup numa
>>>                    * balancing is disabled
>>>                    */
>>> -                if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
>>> +                if (!(sysctl_numa_balancing_mode &
>>> +                    (NUMA_BALANCING_CGROUP | NUMA_BALANCING_NORMAL)) &&
>>>                       toptier)
>>>                       continue;
>>>                   if (folio_use_access_time(folio))
>>

next prev parent reply	other threads:[~2025-07-01 17:59 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-25 10:23 [PATCH v2] sched/numa: Introduce per cgroup numa balance control Chen Yu
2025-06-25 12:19 ` Michal Koutný
2025-06-26  9:07   ` Chen, Yu C
2025-07-01  8:25 ` Libo Chen
2025-07-01 16:36   ` Chen, Yu C
2025-07-01 17:58     ` Libo Chen [this message]
2025-07-02 11:25       ` Chen, Yu C

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=82fb706f-b12d-4340-8f8c-6f949ab82913@oracle.com \
    --to=libo.chen@oracle.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=aubrey.li@intel.com \
    --cc=bsegall@google.com \
    --cc=corbet@lwn.net \
    --cc=daniel.m.jordan@oracle.com \
    --cc=jannh@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=mkoutny@suse.com \
    --cc=pauld@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pfalcato@suse.de \
    --cc=shakeel.butt@linux.dev \
    --cc=tim.c.chen@intel.com \
    --cc=tj@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=vschneid@redhat.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yu.c.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).