All of lore.kernel.org
 help / color / mirror / Atom feed
From: Yonghong Song <yonghong.song@linux.dev>
To: Hou Tao <houtao@huaweicloud.com>, bpf@vger.kernel.org
Cc: Alexei Starovoitov <ast@kernel.org>,
	Andrii Nakryiko <andrii@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	kernel-team@fb.com, Martin KaFai Lau <martin.lau@kernel.org>
Subject: Re: [PATCH bpf-next v3 2/6] bpf: Allow per unit prefill for non-fix-size percpu memory allocator
Date: Sun, 17 Dec 2023 09:21:30 -0800	[thread overview]
Message-ID: <9bf96304-2cfe-453c-a709-00eb56fdf136@linux.dev> (raw)
In-Reply-To: <5c13f568-325c-4e5c-9f9e-ca5da5c2c75b@linux.dev>


On 12/16/23 11:11 PM, Yonghong Song wrote:
>
> On 12/15/23 7:12 PM, Hou Tao wrote:
>> Hi,
>>
>> On 12/16/2023 10:30 AM, Yonghong Song wrote:
>>> Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem 
>>> allocation")
>>> added support for non-fix-size percpu memory allocation.
>>> Such allocation will allocate percpu memory for all buckets on all
>>> cpus and the memory consumption is in the order to quadratic.
>>> For example, let us say, 4 cpus, unit size 16 bytes, so each
>>> cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 
>>> bytes.
>>> Then let us say, 8 cpus with the same unit size, each cpu
>>> has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 
>>> bytes.
>>> So if the number of cpus doubles, the number of memory consumption
>>> will be 4 times. So for a system with large number of cpus, the
>>> memory consumption goes up quickly with quadratic order.
>>> For example, for 4KB percpu allocation, 128 cpus. The total memory
>>> consumption will 4KB * 128 * 128 = 64MB. Things will become
>>> worse if the number of cpus is bigger (e.g., 512, 1024, etc.)
>> SNIP
>>> +__init int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma)
>>> +{
>>> +    struct bpf_mem_caches __percpu *pcc;
>>> +
>>> +    pcc = __alloc_percpu_gfp(sizeof(struct bpf_mem_caches), 8, 
>>> GFP_KERNEL);
>>> +    if (!pcc)
>>> +        return -ENOMEM;
>>> +
>>> +    ma->caches = pcc;
>>> +    ma->percpu = true;
>>> +    return 0;
>>> +}
>>> +
>>> +int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size)
>>> +{
>>> +    int cpu, i, err = 0, unit_size, percpu_size;
>>> +    struct bpf_mem_caches *cc, __percpu *pcc;
>>> +    struct obj_cgroup *objcg;
>>> +    struct bpf_mem_cache *c;
>>> +
>>> +    i = bpf_mem_cache_idx(size);
>>> +    if (i < 0)
>>> +        return -EINVAL;
>>> +
>>> +    /* room for llist_node and per-cpu pointer */
>>> +    percpu_size = LLIST_NODE_SZ + sizeof(void *);
>>> +
>>> +    pcc = ma->caches;
>>> +    unit_size = sizes[i];
>>> +
>>> +#ifdef CONFIG_MEMCG_KMEM
>>> +    objcg = get_obj_cgroup_from_current();
>>> +#endif
>> For bpf_global_percpu_ma, we also need to account the allocated memory
>> to root memory cgroup just like bpf_global_ma did, do we ? So it seems
>> that we need to initialize c->objcg early in 
>> bpf_mem_alloc_percpu_init ().
>
> Good point. Agree. the original behavior percpu non-fix-size mem
> allocation is to do get_obj_cgroup_from_current() at init stage
> and charge to root memory cgroup, and we indeed should move
> the above bpf_mem_alloc_percpu_init().
>
>>> +    for_each_possible_cpu(cpu) {
>>> +        cc = per_cpu_ptr(pcc, cpu);
>>> +        c = &cc->cache[i];
>>> +        if (cpu == 0 && c->unit_size)
>>> +            goto out;
>>> +
>>> +        c->unit_size = unit_size;
>>> +        c->objcg = objcg;
>>> +        c->percpu_size = percpu_size;
>>> +        c->tgt = c;
>>> +
>>> +        init_refill_work(c);
>>> +        prefill_mem_cache(c, cpu);
>>> +
>>> +        if (cpu == 0) {
>>> +            err = check_obj_size(c, i);
>>> +            if (err) {
>>> +                drain_mem_cache(c);
>>> +                memset(c, 0, sizeof(*c));
>> I also forgot about c->objcg. objcg may be leaked if we do memset() 
>> here.
>
> The objcg gets a reference at init bpf_mem_alloc_init() stage
> and released at bpf_mem_alloc_destroy(). For bpf_global_ma,
> if there is a failure, indeed bpf_mem_alloc_destroy() will be
> called and the reference c->objcg will be released.
>
> So if we move get_obj_cgroup_from_current() to
> bpf_mem_alloc_percpu_init() stage, we should be okay here.
>
> BTW, is check_obj_size() really necessary here? My answer is no
> since as you mentioned, the size->cache_index is pretty stable,
> so check_obj_size() should not return error in such cases.
> What do you think?

How about the following change on top of this patch?

diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h
index 43e635c67150..d1403204379e 100644
--- a/include/linux/bpf_mem_alloc.h
+++ b/include/linux/bpf_mem_alloc.h
@@ -11,6 +11,7 @@ struct bpf_mem_caches;
  struct bpf_mem_alloc {
         struct bpf_mem_caches __percpu *caches;
         struct bpf_mem_cache __percpu *cache;
+       struct obj_cgroup *objcg;
         bool percpu;
         struct work_struct work;
  };
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 5cf2738c20a9..6486da4ba097 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -553,6 +553,8 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu)
                 if (memcg_bpf_enabled())
                         objcg = get_obj_cgroup_from_current();
  #endif
+               ma->objcg = objcg;
+
                 for_each_possible_cpu(cpu) {
                         c = per_cpu_ptr(pc, cpu);
                         c->unit_size = unit_size;
@@ -573,6 +575,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu)
  #ifdef CONFIG_MEMCG_KMEM
         objcg = get_obj_cgroup_from_current();
  #endif
+       ma->objcg = objcg;
         for_each_possible_cpu(cpu) {
                 cc = per_cpu_ptr(pcc, cpu);
                 for (i = 0; i < NUM_CACHES; i++) {
@@ -637,6 +640,12 @@ __init int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma)
  
         ma->caches = pcc;
         ma->percpu = true;
+
+#ifdef CONFIG_MEMCG_KMEM
+       ma->objcg = get_obj_cgroup_from_current();
+#else
+       ma->objcg = NULL;
+#endif
         return 0;
  }

@@ -656,10 +665,8 @@ int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size)
  
         pcc = ma->caches;
         unit_size = sizes[i];
+       objcg = ma->objcg;
  
-#ifdef CONFIG_MEMCG_KMEM
-       objcg = get_obj_cgroup_from_current();
-#endif
         for_each_possible_cpu(cpu) {
                 cc = per_cpu_ptr(pcc, cpu);
                 c = &cc->cache[i];
@@ -799,9 +806,8 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
                         rcu_in_progress += atomic_read(&c->call_rcu_ttrace_in_progress);
                         rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
                 }
-               /* objcg is the same across cpus */
-               if (c->objcg)
-                       obj_cgroup_put(c->objcg);
+               if (ma->objcg)
+                       obj_cgroup_put(ma->objcg);
                 destroy_mem_alloc(ma, rcu_in_progress);
         }
         if (ma->caches) {
@@ -817,8 +823,8 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
                                 rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
                         }
                 }
-               if (c->objcg)
-                       obj_cgroup_put(c->objcg);
+               if (ma->objcg)
+                       obj_cgroup_put(ma->objcg);
                 destroy_mem_alloc(ma, rcu_in_progress);
         }
  }

I still think check_obj_size for percpu allocation is not needed.
But I guess we can address that issue later on.

>
>>> +                goto out;
>>> +            }
>>> +        }
>>> +    }
>>> +
>>> +out:
>>> +    return err;
>>> +}
>>> +
>> .
>>
>

  reply	other threads:[~2023-12-17 17:21 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-16  2:30 [PATCH bpf-next v3 0/6] bpf: Reduce memory usage for bpf_global_percpu_ma Yonghong Song
2023-12-16  2:30 ` [PATCH bpf-next v3 1/6] bpf: Avoid unnecessary extra percpu memory allocation Yonghong Song
2023-12-16  2:30 ` [PATCH bpf-next v3 2/6] bpf: Allow per unit prefill for non-fix-size percpu memory allocator Yonghong Song
2023-12-16  3:12   ` Hou Tao
2023-12-17  7:11     ` Yonghong Song
2023-12-17 17:21       ` Yonghong Song [this message]
2023-12-18  1:15         ` Hou Tao
2023-12-16 15:27   ` kernel test robot
2023-12-16  2:30 ` [PATCH bpf-next v3 3/6] bpf: Refill only one percpu element in memalloc Yonghong Song
2023-12-16  3:13   ` Hou Tao
2023-12-16  2:30 ` [PATCH bpf-next v3 4/6] bpf: Limit up to 512 bytes for bpf_global_percpu_ma allocation Yonghong Song
2023-12-16  4:05   ` Hou Tao
2023-12-16  2:30 ` [PATCH bpf-next v3 5/6] selftests/bpf: Cope with 512 bytes limit with bpf_global_percpu_ma Yonghong Song
2023-12-16  4:04   ` Hou Tao
2023-12-16  2:30 ` [PATCH bpf-next v3 6/6] selftests/bpf: Add a selftest with > 512-byte percpu allocation size Yonghong Song
2023-12-16  4:07   ` Hou Tao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9bf96304-2cfe-453c-a709-00eb56fdf136@linux.dev \
    --to=yonghong.song@linux.dev \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=houtao@huaweicloud.com \
    --cc=kernel-team@fb.com \
    --cc=martin.lau@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.