From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="v4U3soXs" Received: from out-174.mta1.migadu.com (out-174.mta1.migadu.com [95.215.58.174]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2D84199 for ; Wed, 13 Dec 2023 09:26:00 -0800 (PST) Message-ID: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1702488358; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=uqhYE0qRPS5MXnIouVC995+UobHLK6bjkNXqr+QueAM=; b=v4U3soXs9D53txzuFvc5w0/MH0FAHPtrpTzdtCQ3YottBrMDpzJPzd7P87/3SC2RfmAU/5 35FixLA/iXaYMMdUzlRWVXX1Y2aIhppcWU7xG6BVwAUt2ivfpChHn5qGuFyrSXbaQ+ZlWA ixelzo2U+iADre/J5FAgtfmn12QbHOU= Date: Wed, 13 Dec 2023 09:25:55 -0800 Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Subject: Re: [PATCH bpf-next 2/5] bpf: Allow per unit prefill for non-fix-size percpu memory allocator Content-Language: en-GB To: Hou Tao , bpf@vger.kernel.org Cc: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , kernel-team@fb.com, Martin KaFai Lau References: <20231212223040.2135547-1-yonghong.song@linux.dev> <20231212223050.2137647-1-yonghong.song@linux.dev> X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Yonghong Song In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Migadu-Flow: FLOW_OUT On 12/13/23 3:03 AM, Hou Tao wrote: > Hi, > > On 12/13/2023 6:30 AM, Yonghong Song wrote: >> Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem allocation") >> added support for non-fix-size percpu memory allocation. >> Such allocation will allocate percpu memory for all buckets on all >> cpus and the memory consumption is in the order to quadratic. >> For example, let us say, 4 cpus, unit size 16 bytes, so each >> cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 bytes. >> Then let us say, 8 cpus with the same unit size, each cpu >> has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 bytes. >> So if the number of cpus doubles, the number of memory consumption >> will be 4 times. So for a system with large number of cpus, the >> memory consumption goes up quickly with quadratic order. >> For example, for 4KB percpu allocation, 128 cpus. The total memory >> consumption will 4KB * 128 * 128 = 64MB. Things will become >> worse if the number of cpus is bigger (e.g., 512, 1024, etc.) >> >> In Commit 41a5db8d8161, the non-fix-size percpu memory allocation is >> done in boot time, so for system with large number of cpus, the initial >> percpu memory consumption is very visible. For example, for 128 cpu >> system, the total percpu memory allocation will be at least >> (16 + 32 + 64 + 96 + 128 + 196 + 256 + 512 + 1024 + 2048 + 4096) >> * 128 * 128 = ~138MB. >> which is pretty big. It will be even bigger for larger number of cpus. >> >> Note that the current prefill also allocates 4 entries if the unit size >> is less than 256. So on top of 138MB memory consumption, this will >> add more consumption with >> 3 * (16 + 32 + 64 + 96 + 128 + 196 + 256) * 128 * 128 = ~38MB. >> Next patch will try to reduce this memory consumption. >> >> Later on, Commit 1fda5bb66ad8 ("bpf: Do not allocate percpu memory >> at init stage") moved the non-fix-size percpu memory allocation >> to bpf verificaiton stage. Once a particular bpf_percpu_obj_new() >> is called by bpf program, the memory allocator will try to fill in >> the cache with all sizes, causing the same amount of percpu memory >> consumption as in the boot stage. >> >> To reduce the initial percpu memory consumption for non-fix-size >> percpu memory allocation, instead of filling the cache with all >> supported allocation sizes, this patch intends to fill the cache >> only for the requested size. As typically users will not use large >> percpu data structure, this can save memory significantly. >> For example, the allocation size is 64 bytes with 128 cpus. >> Then total percpu memory amount will be 64 * 128 * 128 = 1MB, >> much less than previous 138MB. >> >> Signed-off-by: Yonghong Song >> --- >> include/linux/bpf_mem_alloc.h | 5 +++ >> kernel/bpf/memalloc.c | 62 +++++++++++++++++++++++++++++++++++ >> kernel/bpf/verifier.c | 23 +++++-------- >> 3 files changed, 75 insertions(+), 15 deletions(-) >> >> diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h >> index bb1223b21308..b049c580e7fb 100644 >> --- a/include/linux/bpf_mem_alloc.h >> +++ b/include/linux/bpf_mem_alloc.h >> @@ -21,8 +21,13 @@ struct bpf_mem_alloc { >> * 'size = 0' is for bpf_mem_alloc which manages many fixed-size objects. >> * Alloc and free are done with bpf_mem_{alloc,free}() and the size of >> * the returned object is given by the size argument of bpf_mem_alloc(). >> + * If percpu equals true, error will be returned in order to avoid >> + * large memory consumption and the below bpf_mem_alloc_percpu_unit_init() >> + * should be used to do on-demand per-cpu allocation for each size. >> */ >> int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu); >> +/* The percpu allocation is allowed for different unit size. */ >> +int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size); >> void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma); >> >> /* kmalloc/kfree equivalent: */ >> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c >> index 75068167e745..84987e97fd0a 100644 >> --- a/kernel/bpf/memalloc.c >> +++ b/kernel/bpf/memalloc.c >> @@ -526,6 +526,9 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu) >> struct bpf_mem_cache *c, __percpu *pc; >> struct obj_cgroup *objcg = NULL; >> >> + if (percpu && size == 0) >> + return -EINVAL; >> + >> /* room for llist_node and per-cpu pointer */ >> if (percpu) >> percpu_size = LLIST_NODE_SZ + sizeof(void *); >> @@ -625,6 +628,65 @@ static void bpf_mem_alloc_destroy_cache(struct bpf_mem_cache *c) >> drain_mem_cache(c); >> } >> >> +int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size) >> +{ >> + static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096}; > Why duplicate the sizes array ? It is better to move it out of these > functions and share it between both bpf_mem_alloc_ini() and > bpf_mem_alloc_percpu_unit_init(). Good point. Will do in the next revision. > >> + int cpu, i, err, unit_size, percpu_size = 0; >> + struct bpf_mem_caches *cc, __percpu *pcc; >> + struct obj_cgroup *objcg = NULL; >> + struct bpf_mem_cache *c; >> + >> + /* room for llist_node and per-cpu pointer */ >> + percpu_size = LLIST_NODE_SZ + sizeof(void *); >> + >> + if (ma->caches) { >> + pcc = ma->caches; >> + } else { >> + ma->percpu = true; >> + pcc = __alloc_percpu_gfp(sizeof(*cc), 8, GFP_KERNEL | __GFP_ZERO); >> + if (!pcc) >> + return -ENOMEM; >> + ma->caches = pcc; >> + } > It is a little weird to me that a single API does two things: > initialization and incremental refill. How about introducing two APIs to > reduce the memory usage of global per-cpu ma: one API to initialize the > global per-cpu ma in bpf_global_ma_init(), and another API to > incremental refill global per-cpu ma accordingly ? This can ineed to make semantics and code easy to understand. Will make the change in the next revision. >> + >> + err = 0; >> + i = bpf_mem_cache_idx(size + LLIST_NODE_SZ); >> + if (i < 0) { >> + err = -EINVAL; >> + goto out; >> + } >> + unit_size = sizes[i]; >> + [...]