From: Chengming Zhou <chengming.zhou@linux.dev>
To: Ming Yang <yangming73@huawei.com>,
cl@linux.com, penberg@kernel.org, rientjes@google.com,
iamjoonsoo.kim@lge.com, akpm@linux-foundation.org,
vbabka@suse.cz, roman.gushchin@linux.dev, 42.hyeyoo@gmail.com,
linux-mm@kvack.org, linux-kernel@vger.kernel.org
Cc: zhangliang5@huawei.com, wangzhigang17@huawei.com,
liushixin2@huawei.com, alex.chen@huawei.com,
pengyi.pengyi@huawei.com, xiqi2@huawei.com
Subject: Re: [PATCH] slub: fix slub segmentation
Date: Tue, 2 Apr 2024 11:45:19 +0800 [thread overview]
Message-ID: <cd42083e-ea53-48bd-aa32-a16fc9f73ffa@linux.dev> (raw)
In-Reply-To: <20240402031025.1097-1-yangming73@huawei.com>
On 2024/4/2 11:10, Ming Yang wrote:
> When one of numa nodes runs out of memory and lots of processes still
> booting, slabinfo shows much slub segmentation exits. The following
> shows some of them:
>
> tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs>
> <num_slabs> <sharedavail>
> kmalloc-512 84309 380800 1024 32 8 :
> tunables 0 0 0 : slabdata 11900 11900 0
> kmalloc-256 65869 365408 512 32 4 :
> tunables 0 0 0 : slabdata 11419 11419 0
>
> 365408 "kmalloc-256" objects are alloced but only 65869 of them are
> used; While 380800 "kmalloc-512" objects are alloced but only 84309
> of them are used.
>
> This problem exits in the following senario:
> 1. Multiple numa nodes, e.g. four nodes.
> 2. Lack of memory in any one node.
> 3. Functions which alloc many slub memory in certain numa nodes,
> like alloc_fair_sched_group.
>
> The slub segmentation generated because of the following reason:
> In function "___slab_alloc" a new slab is attempted to be gotten via
> function "get_partial". If the argument 'node' is assigned but there
> are neither partial memory nor buddy memory in that assigned node, no
> slab could be gotten. And then the program attempt to alloc new slub
> from buddy system, as mentationed before: no buddy memory in that
> assigned node left, a new slub might be alloced from the buddy system
> of other node directly, no matter whether there is free partil memory
> left on other node. As a result slub segmentation generated.
>
> The key point of above allocation flow is: the slab should be alloced
> from the partial of other node first, instead of the buddy system of
> other node directly.
>
> In this commit a new slub allocation flow is proposed:
> 1. Attempt to get a slab via function get_partial (first step in
> new_objects lable).
> 2. If no slab is gotten and 'node' is assigned, try to alloc a new
> slab just from the assigned node instead of all node.
> 3. If no slab could be alloced from the assigned node, try to alloc
> slub from partial of other node.
> 4. If the alloctation in step 3 fails, alloc a new slub from buddy
> system of all node.
FYI, there is another patch to the very same problem:
https://lore.kernel.org/all/20240330082335.29710-1-chenjun102@huawei.com/
>
> Signed-off-by: Ming Yang <yangming73@huawei.com>
> Signed-off-by: Liang Zhang <zhangliang5@huawei.com>
> Signed-off-by: Zhigang Wang <wangzhigang17@huawei.com>
> Reviewed-by: Shixin Liu <liushixin2@huawei.com>
> ---
> This patch can be tested and verified by following steps:
> 1. First, try to run out memory on node0. echo 1000(depending on your memory) >
> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages.
> 2. Second, boot 10000(depending on your memory) processes which use setsid
> systemcall, as the setsid systemcall may likely call function
> alloc_fair_sched_group.
> 3. Last, check slabinfo, cat /proc/slabinfo.
>
> Hardware info:
> Memory : 8GiB
> CPU (total #): 120
> numa node: 4
>
> Test clang code example:
> int main() {
> void *p = malloc(1024);
> setsid();
> while(1);
> }
>
> mm/slub.c | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 1bb2a93cf7..3eb2e7d386 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3522,7 +3522,18 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> }
>
> slub_put_cpu_ptr(s->cpu_slab);
> + if (node != NUMA_NO_NODE) {
> + slab = new_slab(s, gfpflags | __GFP_THISNODE, node);
> + if (slab)
> + goto slab_alloced;
> +
> + slab = get_any_partial(s, &pc);
> + if (slab)
> + goto slab_alloced;
> + }
> slab = new_slab(s, gfpflags, node);
> +
> +slab_alloced:
> c = slub_get_cpu_ptr(s->cpu_slab);
>
> if (unlikely(!slab)) {
next prev parent reply other threads:[~2024-04-02 3:45 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-04-02 3:10 [PATCH] slub: fix slub segmentation Ming Yang
2024-04-02 3:45 ` Chengming Zhou [this message]
2024-04-02 16:13 ` Vlastimil Babka
2024-04-04 19:12 ` Christoph Lameter (Ampere)
2024-04-05 9:05 ` Vlastimil Babka
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cd42083e-ea53-48bd-aa32-a16fc9f73ffa@linux.dev \
--to=chengming.zhou@linux.dev \
--cc=42.hyeyoo@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=alex.chen@huawei.com \
--cc=cl@linux.com \
--cc=iamjoonsoo.kim@lge.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=liushixin2@huawei.com \
--cc=penberg@kernel.org \
--cc=pengyi.pengyi@huawei.com \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=vbabka@suse.cz \
--cc=wangzhigang17@huawei.com \
--cc=xiqi2@huawei.com \
--cc=yangming73@huawei.com \
--cc=zhangliang5@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.