From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-ej1-f42.google.com (mail-ej1-f42.google.com [209.85.218.42]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 32CE2364036 for ; Thu, 23 Apr 2026 08:44:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.218.42 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776933895; cv=none; b=cel4DFbMIHwBA5F7tePFhlLoOjaVHHwppjMhQZdCSGUmquJIeHOxTl+YXxFofWEPyhjTrPYU9dHH4vMYnMLx0pluNVtqClPcWrz5Gb3HW6aWmeobRtBfoopxfk/9La0uytjBA8hDLCrWGgg0D6h+q8kxMdkqmRuJfFZysaAv0gI= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776933895; c=relaxed/simple; bh=cy+D72/8SRkqSUFJcf6yN+A6UIuzoSzmqoeAC8A+zZg=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=ptX9LnfKcer/Jl9luFeav/GbbDGsyMuyWFvMKHFRHTPsI6eeQIoxdD3R+kKs1xZmNhlhGoZmaHj4hn5QabqolkrRxOt8g3ZEesl3JvxaBbb2PW2EIJXtEqUnUVq3Yep2iGeaAGlNE0t79JPolvnml40xgbe9Bn4WO1t3A8MkEGM= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=QYlJu6k+; arc=none smtp.client-ip=209.85.218.42 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="QYlJu6k+" Received: by mail-ej1-f42.google.com with SMTP id a640c23a62f3a-ba922426c5cso63429166b.3 for ; Thu, 23 Apr 2026 01:44:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1776933890; x=1777538690; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=hNUbuJfe9IlhLdXeCzBcNXxszuJ9rsuCAWYteaUoJQc=; b=QYlJu6k+wdkxYubo2YZSP0FN229WGQKqJaI0AJ/aNEfrEqZ991Ql79vbK0wpbwu9jD odegmO8v3yJ6MEVFbc0m8r84DXJWOPOkkzuukDwDnno5lgwb9qZnK3a0w72JbwFI3DY/ +VTKAZur2Cx93JCLIfELNRqg0RFRh86BhZDaImel3p3UpphxIq8weTZejZbDgOfomWBx 3I9ozjbIg5NePfzqcdyv3v82LoSD1FAxfTFLnb5Qun8zEBEpxZSUZNZ9P0+HzGbUGCN+ clU08aYIAOedR6ImJkxEdIN5vxo53ICGnJeaWB9RRqKYmsiIvqtrfuGlMNf8z63eSu2x kOpA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1776933890; x=1777538690; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-gg:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=hNUbuJfe9IlhLdXeCzBcNXxszuJ9rsuCAWYteaUoJQc=; b=Q/7CJoxvDOAUt9hTcm8+VqImLJgI6hQTtCI8HiQCoiL77j3mKE8JpCnOOK51oiG1rs qm2vMWRekzTAYcb6KJdWPSWVLtFBYSA9/OU6FWWkwWCG3b7SNH49/Seg8ABwS5k2yiOq sWP5B0lmGp+tFxMP2bD9KGQmDGaHk9vxsqIABwpx3Ui2Hf2frY6oXdX+RJAxZSi8r4zy xD+nZShOPw+rF5WcdPpEVcMOfecewtAdDxUavd6zdmqePDMwpscZhaDXJshkRjZVY9YZ cKSrr5T+4PykEGas2RssbuX6D7hrpHainA2xq2/dY+CUG4JruHaHCOZQjhdHLc3f2g16 w/6w== X-Gm-Message-State: AOJu0YyfGbaqTtYAiSaZtDgILS4pBR4kibYnQr+2UlhcvDeKZ/iAaEEU kHCh8HbmUqn7JdifTfVu7UXYXOw93CbR4NWpVmbDPXe34rs6cHB6fSLbEaD1n81caw== X-Gm-Gg: AeBDievYk2hv99EhV0RIZBFP41/Y6kLs8d7KydK9LIJkBlIDDx+zWnK/mXqmtctxsU5 lXUjxl7Lnw8DV5N/PQLWquRxI0mveYAdkb9xLGok6EqQZpV6jFBAGuaD6wRgJVw/S/xsXpcxIed h6E9GHNHFUhSP3dkTc2eXEpxhQ04ehOTN5QufFWOz5z1u3VocvzcP+XlvPS8+BYdOH59Cb2LDYr OZJ7W8L3+DLOppwPF5ZFOpxtsKXLsfNzZnK0rMA4GiBTVJbd2coimuysRooZTZsbraQb3Le3pvt 3p/r9BzVbKqieKY4N+x4ms46uTiIiGY0A9CcALQcjjR8c3lx2P4SW/6pVea16nibDuORDjsjxgA wDLZdsNV0GZ7Qt/Qe9+Udl9Q0HNVuLl7EeO5vgcJuHhXHLKwBEDRB9B+PnEz3vQxsV+w19MZPX7 WiHaNHDAFE521KIC91hdKbPg/x5ftPLGbivAluH4o9o2aqLbI3eAUpwpRU835jqy1KE8KJ53STS Sr2q2z7ZhsF X-Received: by 2002:a17:906:6a19:b0:ba7:7f6d:be4a with SMTP id a640c23a62f3a-ba77f6dcc54mr980655366b.26.1776933889640; Thu, 23 Apr 2026 01:44:49 -0700 (PDT) Received: from google.com (57.35.34.34.bc.googleusercontent.com. [34.34.35.57]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-ba4550468d6sm628012766b.52.2026.04.23.01.44.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 23 Apr 2026 01:44:49 -0700 (PDT) Date: Thu, 23 Apr 2026 08:44:45 +0000 From: Matt Bobrowski To: Emil Tsalapatis Cc: bpf@vger.kernel.org, ast@kernel.org, andrii@kernel.org, memxor@gmail.com, daniel@iogearbox.net, eddyz87@gmail.com, song@kernel.org Subject: Re: [PATCH bpf-next v8 6/8] selftests/bpf: Add buddy allocator for libarena Message-ID: References: <20260421165037.4736-1-emil@etsalapatis.com> <20260421165037.4736-7-emil@etsalapatis.com> Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20260421165037.4736-7-emil@etsalapatis.com> On Tue, Apr 21, 2026 at 12:50:35PM -0400, Emil Tsalapatis wrote: > Add a byte-oriented buddy allocator for libarena. The buddy > allocator provides an alloc/free interface for small arena allocations > ranging from 16 bytes to 512 KiB. Lower allocations values are rounded > up to 16 bytes. The buddy allocator does not handle larger allocations > that can instead use the existing bpf_arena_{alloc, free}_pages() kfunc. > > Signed-off-by: Emil Tsalapatis The implementation of this BPF arena backed buddy allocator looks rather solid. I noticed that we have a single global lock being used here to synchronize metadata structure modifications. Have you thought about how this might be a high contention point, particularly when dealing with systems with incredibly high CPU core counts? Did you consider possibly making use of localized per-CPU caches for the most common/small allocation sizes or something like that? This is something that definitely crossed my mind when I was initially thinking about implementing something like this a while back. I understand that this is an initial implementation, so it doesn't necessarily carry all the bells and whistles, but I was curious to know whether this had also crossed your mind and what your thoughts were on it. > --- > tools/testing/selftests/bpf/libarena/Makefile | 2 + > .../selftests/bpf/libarena/include/buddy.h | 92 ++ > .../selftests/bpf/libarena/include/common.h | 14 + > .../bpf/libarena/selftests/selftest.c | 1 + > .../selftests/bpf/libarena/src/buddy.bpf.c | 903 ++++++++++++++++++ > .../selftests/bpf/libarena/src/common.bpf.c | 23 + > 6 files changed, 1035 insertions(+) > create mode 100644 tools/testing/selftests/bpf/libarena/include/buddy.h > create mode 100644 tools/testing/selftests/bpf/libarena/src/buddy.bpf.c > > diff --git a/tools/testing/selftests/bpf/libarena/Makefile b/tools/testing/selftests/bpf/libarena/Makefile > index cbd23c928e53..4fe048ef57e9 100644 > --- a/tools/testing/selftests/bpf/libarena/Makefile > +++ b/tools/testing/selftests/bpf/libarena/Makefile > @@ -51,6 +51,8 @@ ASAN_FLAGS += -mllvm -asan-destructor-kind=none > override BPF_CFLAGS += -DENABLE_ATOMICS_TESTS > override BPF_CFLAGS += -O2 -g > override BPF_CFLAGS += -Wno-incompatible-pointer-types-discards-qualifiers > +# Required to define our own arena-based free() > +override BPF_CFLAGS += -Wno-incompatible-library-redeclaration > # Required for suppressing harmless vmlinux.h-related warnings. > override BPF_CFLAGS += -Wno-missing-declarations > override BPF_CFLAGS += $(INCLUDES) > diff --git a/tools/testing/selftests/bpf/libarena/include/buddy.h b/tools/testing/selftests/bpf/libarena/include/buddy.h > new file mode 100644 > index 000000000000..00e2437128ef > --- /dev/null > +++ b/tools/testing/selftests/bpf/libarena/include/buddy.h > @@ -0,0 +1,92 @@ > +// SPDX-License-Identifier: LGPL-2.1 OR BSD-2-Clause > +/* Copyright (c) 2026 Meta Platforms, Inc. and affiliates. */ > +#pragma once > + > +struct buddy_chunk; > +typedef struct buddy_chunk __arena buddy_chunk_t; > + > +struct buddy_header; > +typedef struct buddy_header __arena buddy_header_t; > + > +enum buddy_consts { > + /* > + * Minimum allocation is 1 << BUDDY_MIN_ALLOC_SHIFT. > + * Larger sizes increase internal fragmentation, but smaller > + * sizes increase the space overhead of the block metadata. > + */ > + BUDDY_MIN_ALLOC_SHIFT = 4, > + BUDDY_MIN_ALLOC_BYTES = 1 << BUDDY_MIN_ALLOC_SHIFT, > + > + /* > + * How many orders the buddy allocator can serve. Minimum block > + * size is 1 << BUDDY_MIN_ALLOC_SHIFT, maximum block size is > + * 1 << (BUDDY_MIN_ALLOC_SHIFT + BUDDY_CHUNK_NUM_ORDERS - 1): > + * Each block has size 1 << BUDDY_MIN_ALLOC_SHIFT, and the > + * allocation orders are in [0, BUDDY_CHUNK_NUM_ORDERS). > + * We keep two blocks of the maximum size to retain the > + * property in the code that all blocks have a buddy. > + * Higher values increase the maximum allocation size, > + * but also the size of the metadata for each block. > + */ > + BUDDY_CHUNK_NUM_ORDERS = 1 << 4, > + BUDDY_CHUNK_BYTES = BUDDY_MIN_ALLOC_BYTES << (BUDDY_CHUNK_NUM_ORDERS), > + > + /* Offset of the buddy header within a free block, see buddy.bpf.c for details */ > + BUDDY_HEADER_OFF = 8, > + > + /* The maximum number of blocks a chunk may have to track. */ > + BUDDY_CHUNK_ITEMS = 1 << (BUDDY_CHUNK_NUM_ORDERS), > + BUDDY_CHUNK_OFFSET_MASK = BUDDY_CHUNK_BYTES - 1, > + > + /* > + * Alignment for chunk allocations based on bpf_arena_alloc_pages. > + * The arena allocation kfunc does not have an alignment argument, > + * but that is required for all block calculations in the chunk to > + * work. > + */ > + BUDDY_VADDR_OFFSET = BUDDY_CHUNK_BYTES, > + > + /* Total arena virtual address space the allocator can consume. */ > + BUDDY_VADDR_SIZE = BUDDY_CHUNK_BYTES << 10 > +}; > + > +struct buddy_header { > + u32 prev_index; /* "Pointer" to the previous available allocation of the same size. */ > + u32 next_index; /* Same for the next allocation. */ > +}; > + > +/* > + * We bring memory into the allocator 1 MiB at a time. > + */ > +struct buddy_chunk { > + /* The order of the current allocation for a item. 4 bits per order. */ > + u8 orders[BUDDY_CHUNK_ITEMS / 2]; > + /* > + * Bit to denote whether chunk is allocated. Size of the allocated/free > + * chunk found from the orders array. > + */ > + u8 allocated[BUDDY_CHUNK_ITEMS / 8]; > + /* Freelists for O(1) allocation. */ > + u64 freelists[BUDDY_CHUNK_NUM_ORDERS]; > + buddy_chunk_t *next; > +}; > + > +struct buddy { > + buddy_chunk_t *first_chunk; /* Pointer to the chunk linked list. */ > + arena_spinlock_t lock; /* Allocator lock */ > + u64 vaddr; /* Allocation into reserved vaddr */ > +}; > + > +typedef struct buddy __arena buddy_t; > + > +#ifdef __BPF__ > + > +int buddy_init(buddy_t *buddy); > +int buddy_destroy(buddy_t *buddy); > +int buddy_free_internal(buddy_t *buddy, u64 free); > +#define buddy_free(buddy, ptr) buddy_free_internal((buddy), (u64)(ptr)) > +u64 buddy_alloc_internal(buddy_t *buddy, size_t size); > +#define buddy_alloc(alloc, size) ((void __arena *)buddy_alloc_internal((alloc), (size))) > + > + > +#endif /* __BPF__ */ > diff --git a/tools/testing/selftests/bpf/libarena/include/common.h b/tools/testing/selftests/bpf/libarena/include/common.h > index 7b2a91b85978..31c73b2db5bd 100644 > --- a/tools/testing/selftests/bpf/libarena/include/common.h > +++ b/tools/testing/selftests/bpf/libarena/include/common.h > @@ -42,6 +42,20 @@ extern volatile u64 asan_violated; > > int arena_fls(__u64 word); > > +u64 malloc_internal(size_t size); > +#define malloc(size) ((void __arena *)malloc_internal((size))) > +void free(void __arena *ptr); > + > +/* > + * The verifier associates arenas with programs by checking LD.IMM > + * instruction operands for an arena and populating the program state > + * with the first instance it finds. This requires accessing our global > + * arena variable, but subprogs do not necessarily do so while still > + * using pointers from that arena. Insert an LD.IMM instruction to > + * access the arena and help the verifier. > + */ > +#define arena_subprog_init() do { asm volatile ("" :: "r"(&arena)); } while (0) > + > #else /* ! __BPF__ */ > > #include > diff --git a/tools/testing/selftests/bpf/libarena/selftests/selftest.c b/tools/testing/selftests/bpf/libarena/selftests/selftest.c > index d1377d8073ef..4c95c36aa097 100644 > --- a/tools/testing/selftests/bpf/libarena/selftests/selftest.c > +++ b/tools/testing/selftests/bpf/libarena/selftests/selftest.c > @@ -17,6 +17,7 @@ > > #include > #include > +#include > #include > > #ifdef BPF_ARENA_ASAN > diff --git a/tools/testing/selftests/bpf/libarena/src/buddy.bpf.c b/tools/testing/selftests/bpf/libarena/src/buddy.bpf.c > new file mode 100644 > index 000000000000..489b7383a53a > --- /dev/null > +++ b/tools/testing/selftests/bpf/libarena/src/buddy.bpf.c > @@ -0,0 +1,903 @@ > +// SPDX-License-Identifier: LGPL-2.1 OR BSD-2-Clause > +/* Copyright (c) 2026 Meta Platforms, Inc. and affiliates. */ > + > +#include > +#include > +#include > + > +/* > + * Buddy allocator arena-based implementation. > + * > + * Memory is organized into chunks. These chunks > + * cannot be coalesced or split. Allocating > + * chunks allocates their memory eagerly. > + * > + * Internally, each chunk is organized into blocks. > + * Blocks _can_ be coalesced/split, but only inside > + * the chunk. Each block can be allocated or > + * unallocated. If allocated, the entire block holds > + * user data. If unallocated, the block is mostly > + * invalid memory, with the exception of a header > + * used for freelist tracking. > + * > + * The header is placed at an offset inside the block > + * to prevent off-by-one errors from the previous block > + * from trivially overwriting the header. Such an error > + * is also not catchable by ASAN, since the header remains > + * valid memory even after the block is freed. It is still > + * theoretically possible for the header to be corrupted > + * without being caught by ASAN, but harder. > + * > + * Since the allocator needs to track order information for > + * both allocated and free blocks, and allocated blocks cannot > + * store a header, the allocator also stores per-chunk order > + * information in a reserved region at the beginning of the > + * chunk. The header includes a bitmap with the order of blocks > + * and their allocation state. It also includes the freelist > + * heads for the allocation itself. > + */ > + > +enum { > + BUDDY_POISONED = (s8)0xef, > + > + /* Number of pages to be allocated per chunk. */ > + BUDDY_CHUNK_PAGES = BUDDY_CHUNK_BYTES / __PAGE_SIZE > +}; > + > +static inline int buddy_lock(buddy_t *buddy) > +{ > + return arena_spin_lock(&buddy->lock); > +} > + > +static inline void buddy_unlock(buddy_t *buddy) > +{ > + arena_spin_unlock(&buddy->lock); > +} > + > +/* > + * Reserve part of the arena address space for the allocator. We use > + * this to get aligned addresses for the chunks, since the arena > + * page alloc kfuncs do not support aligning to a boundary (in this > + * case 1 MiB, see buddy.h on how this is derived). > + */ > +static int buddy_reserve_arena_vaddr(buddy_t *buddy) > +{ > + buddy->vaddr = 0; > + > + return bpf_arena_reserve_pages(&arena, > + (void __arena *)BUDDY_VADDR_OFFSET, > + BUDDY_VADDR_SIZE / __PAGE_SIZE); > +} > + > +/* > + * Free up any unused address space. Used only during teardown. > + */ > +static void buddy_unreserve_arena_vaddr(buddy_t *buddy) > +{ > + bpf_arena_free_pages( > + &arena, (void __arena *)(BUDDY_VADDR_OFFSET + buddy->vaddr), > + (BUDDY_VADDR_SIZE - buddy->vaddr) / __PAGE_SIZE); > + > + buddy->vaddr = 0; > +} > + > +/* > + * Carve out part of the reserved address space and hand it over > + * to the buddy allocator. > + * > + * We are assuming the buddy allocator is the only allocator in the > + * system, so there is no race between this function reserving a > + * page range and some other allocator actually making the BPF call > + * to really create and reserve it. > + * > + * However, bump allocation must still be atomic because this function > + * is called without the buddy lock from multiple threads concurrently. > + */ > +__weak int buddy_alloc_arena_vaddr(buddy_t __arg_arena *buddy, u64 *vaddrp) > +{ > + u64 vaddr, old, new; > + > + if (!buddy || !vaddrp) > + return -EINVAL; > + > + do { > + vaddr = buddy->vaddr; > + new = vaddr + BUDDY_CHUNK_BYTES; > + > + if (new > BUDDY_VADDR_SIZE) > + return -EINVAL; > + > + old = __sync_val_compare_and_swap(&buddy->vaddr, vaddr, new); > + } while (old != vaddr && can_loop); > + > + if (old != vaddr) > + return -EINVAL; > + > + *vaddrp = BUDDY_VADDR_OFFSET + vaddr; > + > + return 0; > +} > + > +static u64 arena_next_pow2(__u64 n) > +{ > + n--; > + n |= n >> 1; > + n |= n >> 2; > + n |= n >> 4; > + n |= n >> 8; > + n |= n >> 16; > + n |= n >> 32; > + n++; > + > + return n; > +} > + > +__weak > +int idx_set_allocated(buddy_chunk_t __arg_arena *chunk, u64 idx, bool allocated) > +{ > + bool already_allocated; > + > + if (unlikely(idx >= BUDDY_CHUNK_ITEMS)) { > + arena_stderr("setting state of invalid idx (%ld, max %d)\n", idx, > + BUDDY_CHUNK_ITEMS); > + return -EINVAL; > + } > + > + already_allocated = chunk->allocated[idx / 8] & (1 << (idx % 8)); > + if (unlikely(already_allocated == allocated)) { > + arena_stderr("Double %s of idx %ld for chunk %p", > + allocated ? "alloc" : "free", > + idx, chunk); > + return -EINVAL; > + } > + > + if (allocated) > + chunk->allocated[idx / 8] |= 1 << (idx % 8); > + else > + chunk->allocated[idx / 8] &= ~(1 << (idx % 8)); > + > + return 0; > +} > + > +static int idx_is_allocated(buddy_chunk_t *chunk, u64 idx, bool *allocated) > +{ > + if (unlikely(idx >= BUDDY_CHUNK_ITEMS)) { > + arena_stderr("getting state of invalid idx (%llu, max %d)\n", idx, > + BUDDY_CHUNK_ITEMS); > + return -EINVAL; > + } > + > + *allocated = chunk->allocated[idx / 8] & (1 << (idx % 8)); > + return 0; > +} > + > +__weak > +int idx_set_order(buddy_chunk_t __arg_arena *chunk, u64 idx, u8 order) > +{ > + u8 prev_order; > + > + if (unlikely(order >= BUDDY_CHUNK_NUM_ORDERS)) { > + arena_stderr("setting invalid order %u\n", order); > + return -EINVAL; > + } > + > + if (unlikely(idx >= BUDDY_CHUNK_ITEMS)) { > + arena_stderr("setting order of invalid idx (%d, max %d)\n", idx, > + BUDDY_CHUNK_ITEMS); > + return -EINVAL; > + } > + > + /* > + * We store two order instances per byte, one per nibble. > + * Retain the existing nibble. > + */ > + prev_order = chunk->orders[idx / 2]; > + if (idx & 0x1) { > + order &= 0xf; > + order |= (prev_order & 0xf0); > + } else { > + order <<= 4; > + order |= (prev_order & 0xf); > + } > + > + chunk->orders[idx / 2] = order; > + > + return 0; > +} > + > +static u8 idx_get_order(buddy_chunk_t *chunk, u64 idx) > +{ > + u8 result; > + > + _Static_assert(BUDDY_CHUNK_NUM_ORDERS <= 16, > + "order must fit in 4 bits"); > + > + if (unlikely(idx >= BUDDY_CHUNK_ITEMS)) { > + arena_stderr("getting order of invalid idx %u\n", idx); > + return BUDDY_CHUNK_NUM_ORDERS; > + } > + > + result = chunk->orders[idx / 2]; > + > + return (idx & 0x1) ? (result & 0xf) : (result >> 4); > +} > + > +static void __arena *idx_to_addr(buddy_chunk_t *chunk, size_t idx) > +{ > + u64 address; > + > + if (unlikely(idx >= BUDDY_CHUNK_ITEMS)) { > + arena_stderr("translating invalid idx %u\n", idx); > + return NULL; > + } > + > + /* > + * The data blocks start in the chunk after the metadata block. > + * We find the actual address by indexing into the region at an > + * BUDDY_MIN_ALLOC_BYTES granularity, the minimum allowed. > + * The index number already accounts for the fact that the first > + * blocks in the chunk are occupied by the metadata, so we do > + * not need to offset it. > + */ > + > + address = (u64)chunk + (idx * BUDDY_MIN_ALLOC_BYTES); > + > + return (void __arena *)address; > +} > + > +static buddy_header_t *idx_to_header(buddy_chunk_t *chunk, size_t idx) > +{ > + bool allocated; > + u64 address; > + > + if (unlikely(idx_is_allocated(chunk, idx, &allocated))) { > + arena_stderr("accessing invalid idx 0x%lx\n", idx); > + return NULL; > + } > + > + if (unlikely(allocated)) { > + arena_stderr("accessing allocated idx 0x%lx as header\n", idx); > + return NULL; > + } > + > + address = (u64)idx_to_addr(chunk, idx); > + if (!address) > + return NULL; > + > + /* > + * Offset the header within the block. This avoids accidental overwrites > + * to the header because of off-by-one errors when using adjacent blocks. > + * > + * The offset has been chosen as a compromise between ASAN effectiveness > + * and allocator granularity: > + * 1) ASAN dictates valid data runs are 8-byte aligned. > + * 2) We want to keep a low minimum allocation size (currently 16). > + * > + * As a result, we have only two possible positions for the header: Bytes > + * 0 and 8. Keeping the header in byte 0 means off-by-ones from the previous > + * block touch the header, and, since the header must be accessible, ASAN > + * will not trigger. Keeping the header on byte 8 means off-by-one errors from > + * the previous block are caught by ASAN. Negative offsets are rarer, so > + * while accesses into the block from the next block are possible, they are > + * less probable. > + */ > + > + return (buddy_header_t *)(address + BUDDY_HEADER_OFF); > +} > + > +static void header_add_freelist(buddy_chunk_t *chunk, buddy_header_t *header, > + u64 idx, u8 order) > +{ > + buddy_header_t *tmp_header; > + > + idx_set_order(chunk, idx, order); > + > + header->next_index = chunk->freelists[order]; > + header->prev_index = BUDDY_CHUNK_ITEMS; > + > + if (header->next_index != BUDDY_CHUNK_ITEMS) { > + tmp_header = idx_to_header(chunk, header->next_index); > + tmp_header->prev_index = idx; > + } > + > + chunk->freelists[order] = idx; > +} > + > +static void header_remove_freelist(buddy_chunk_t *chunk, > + buddy_header_t *header, u8 order) > +{ > + buddy_header_t *tmp_header; > + > + if (header->prev_index != BUDDY_CHUNK_ITEMS) { > + tmp_header = idx_to_header(chunk, header->prev_index); > + tmp_header->next_index = header->next_index; > + } > + > + if (header->next_index != BUDDY_CHUNK_ITEMS) { > + tmp_header = idx_to_header(chunk, header->next_index); > + tmp_header->prev_index = header->prev_index; > + } > + > + /* Pop off the list head if necessary. */ > + if (idx_to_header(chunk, chunk->freelists[order]) == header) > + chunk->freelists[order] = header->next_index; > + > + header->prev_index = BUDDY_CHUNK_ITEMS; > + header->next_index = BUDDY_CHUNK_ITEMS; > +} > + > +static u64 size_to_order(size_t size) > +{ > + u64 order; > + > + /* > + * Legal sizes are [1, 4GiB] (the biggest possible arena). > + * Of course, sizes close to GiB are practically impossible > + * to fulfill and allocation will fail, but that's taken care > + * of by the caller. > + */ > + > + if (unlikely(size == 0 || size > (1UL << 32))) { > + arena_stderr("illegal size request %lu\n", size); > + return 64; > + } > + /* > + * To find the order of the allocation we find the first power of two > + * >= the requested size, take the log2, then adjust it for the minimum > + * allocation size by removing the minimum shift from it. Requests > + * smaller than the minimum allocation size are rounded up. > + */ > + order = arena_fls(arena_next_pow2(size)) - 1; > + if (order < BUDDY_MIN_ALLOC_SHIFT) > + return 0; > + > + return order - BUDDY_MIN_ALLOC_SHIFT; > +} > + > +__weak > +int add_leftovers_to_freelist(buddy_chunk_t __arg_arena *chunk, u32 cur_idx, > + u64 min_order, u64 max_order) > +{ > + buddy_header_t *header; > + u64 ord; > + u32 idx; > + > + for (ord = min_order; ord < max_order && can_loop; ord++) { > + /* Mark the buddy as free and add it to the freelists. */ > + idx = cur_idx + (1 << ord); > + > + header = idx_to_header(chunk, idx); > + if (unlikely(!header)) { > + arena_stderr("idx %u has no header", idx); > + return -EINVAL; > + } > + > + asan_unpoison(header, sizeof(*header)); > + > + header_add_freelist(chunk, header, idx, ord); > + } > + > + return 0; > +} > + > +static buddy_chunk_t *buddy_chunk_get(buddy_t *buddy) > +{ > + u64 order, ord, min_order, max_order; > + buddy_chunk_t *chunk; > + size_t left; > + int power2; > + u64 vaddr; > + u32 idx; > + int ret; > + > + /* > + * Step 1: Allocate a properly aligned chunk, and > + * prep it for insertion into the buddy allocator. > + * We don't need the allocator lock until step 2. > + */ > + > + ret = buddy_alloc_arena_vaddr(buddy, &vaddr); > + if (ret) > + return NULL; > + > + /* Addresses must be aligned to the chunk boundary. */ > + if (vaddr % BUDDY_CHUNK_BYTES) > + return NULL; > + > + /* Unreserve the address space. */ > + bpf_arena_free_pages(&arena, (void __arena *)vaddr, > + BUDDY_CHUNK_PAGES); > + > + chunk = bpf_arena_alloc_pages(&arena, (void __arena *)vaddr, > + BUDDY_CHUNK_PAGES, NUMA_NO_NODE, 0); > + if (!chunk) { > + arena_stderr("[ALLOC FAILED]"); > + return NULL; > + } > + > + if (buddy_lock(buddy)) { > + /* > + * We cannot reclaim the vaddr space, but that is ok - this > + * operation should always succeed. The error path is to catch > + * accidental deadlocks that will cause -ENOMEMs to the program as > + * the allocator fails to refill itself, in which case vaddr usage > + * is the least of our worries. > + */ > + bpf_arena_free_pages(&arena, (void __arena *)vaddr, BUDDY_CHUNK_PAGES); > + return NULL; > + } > + > + asan_poison(chunk, BUDDY_POISONED, BUDDY_CHUNK_PAGES * __PAGE_SIZE); > + > + /* Unpoison the chunk itself. */ > + asan_unpoison(chunk, sizeof(*chunk)); > + > + /* Mark all freelists as empty. */ > + for (ord = zero; ord < BUDDY_CHUNK_NUM_ORDERS && can_loop; ord++) > + chunk->freelists[ord] = BUDDY_CHUNK_ITEMS; > + > + /* > + * Initialize the chunk by carving out a page range to hold the metadata > + * struct above, then dumping the rest of the pages into the allocator. > + */ > + > + _Static_assert(BUDDY_CHUNK_PAGES * __PAGE_SIZE >= > + BUDDY_MIN_ALLOC_BYTES * > + BUDDY_CHUNK_ITEMS, > + "chunk must fit within the allocation"); > + > + /* > + * Step 2: Reserve a chunk for the chunk metadata, then breaks > + * the rest of the full allocation into the different buckets. > + * We allocating the memory by grabbing blocks of progressively > + * smaller sizes from the allocator, which are guaranteed to be > + * continuous. > + * > + * This operation also populates the allocator. > + * > + * Algorithm: > + * > + * - max_order: The last order allocation we made > + * - left: How many bytes are left to allocate > + * - cur_index: Current index into the top-level block we are > + * allocating from. > + * > + * Step 3: > + * - Find the largest power-of-2 allocation still smaller than left (infimum) > + * - Reserve a chunk of that size, along with its buddy > + * - For every order from [infimum + 1, last order), carve out a block > + * and put it into the allocator. > + * > + * Example: Chunk size 0b1010000 (80 bytes) > + * > + * Step 1: > + * > + * idx infimum 1 << max_order > + * 0 64 128 1 << 20 > + * |________|_________|______________________| > + * > + * Blocks set aside: > + * [0, 64) - Completely allocated > + * [64, 128) - Will be further split in the next iteration > + * > + * Blocks added to the allocator: > + * [128, 256) > + * [256, 512) > + * ... > + * [1 << 18, 1 << 19) > + * [1 << 19, 1 << 20) > + * > + * Step 2: > + * > + * idx infimum idx + 1 << max_order > + * 64 80 96 64 + 1 << 6 = 128 > + * |________|_________|______________________| > + * > + * Blocks set aside: > + * [64, 80) - Completely allocated > + * > + * Blocks added to the allocator: > + * [80, 96) - left == 0 so the buddy is unused and marked as freed > + * [96, 128) > + */ > + max_order = BUDDY_CHUNK_NUM_ORDERS; > + left = sizeof(*chunk); > + idx = 0; > + while (left && can_loop) { > + power2 = arena_fls(left) - 1; > + /* > + * Note: The condition below only triggers to catch serious bugs > + * early. There is no sane way to undo any block insertions from > + * the allocated chunk, so just leak any leftover allocations, > + * emit a diagnostic, unlock and exit. > + * > + */ > + if (unlikely(power2 >= BUDDY_CHUNK_NUM_ORDERS)) { > + arena_stderr( > + "buddy chunk metadata require allocation of order %d\n", > + power2); > + arena_stderr( > + "chunk has size of 0x%lx bytes (left %lx bytes)\n", > + sizeof(*chunk), left); > + buddy_unlock(buddy); > + > + return NULL; > + } > + > + /* Round up allocations that are too small. */ > + > + left -= (power2 >= BUDDY_MIN_ALLOC_SHIFT) ? 1 << power2 : left; > + order = (power2 >= BUDDY_MIN_ALLOC_SHIFT) ? power2 - BUDDY_MIN_ALLOC_SHIFT : 0; > + > + if (idx_set_allocated(chunk, idx, true)) { > + buddy_unlock(buddy); > + return NULL; > + } > + > + /* > + * Starting an order above the one we allocated, populate > + * the allocator with free blocks. If this is the last > + * allocation (left == 0), also mark the buddy as free. > + * > + * See comment above about error handling: The error path > + * is only there as a way to mitigate deeply buggy allocator > + * states by emitting a diagnostic in add_leftovers_to_freelist() > + * and leaking any memory not added in the freelists. > + */ > + min_order = left ? order + 1 : order; > + if (add_leftovers_to_freelist(chunk, idx, min_order, max_order)) { > + buddy_unlock(buddy); > + return NULL; > + } > + > + /* Adjust the index. */ > + idx += 1 << order; > + max_order = order; > + } > + > + buddy_unlock(buddy); > + > + return chunk; > +} > + > +__weak int buddy_init(buddy_t __arg_arena *buddy) > +{ > + buddy_chunk_t *chunk; > + int ret; > + > + if (!asan_ready()) > + return -EINVAL; > + > + /* Reserve enough address space to ensure allocations are aligned. */ > + ret = buddy_reserve_arena_vaddr(buddy); > + if (ret) > + return ret; > + > + _Static_assert(BUDDY_CHUNK_PAGES > 0, > + "chunk must use one or more pages"); > + > + chunk = buddy_chunk_get(buddy); > + > + if (buddy_lock(buddy)) { > + bpf_arena_free_pages(&arena, chunk, BUDDY_CHUNK_PAGES); > + return -EINVAL; > + } > + > + /* Chunk is already properly unpoisoned if allocated. */ > + if (chunk) > + chunk->next = buddy->first_chunk; > + > + /* Put the chunk at the beginning of the list. */ > + buddy->first_chunk = chunk; > + > + buddy_unlock(buddy); > + > + return chunk ? 0 : -ENOMEM; > +} > + > +/* > + * Destroy the allocator. This does not check whether there are any allocations > + * currently in use, so any pages being accessed will start taking arena faults. > + * We do not take a lock because we are freeing arena pages, and nobody should > + * be using the allocator at that point in the execution. > + */ > +__weak int buddy_destroy(buddy_t __arg_arena *buddy) > +{ > + buddy_chunk_t *chunk, *next; > + > + if (!buddy) > + return -EINVAL; > + > + /* > + * Traverse all buddy chunks and free them back to the arena > + * with the same granularity they were allocated with. > + */ > + for (chunk = buddy->first_chunk; chunk && can_loop; chunk = next) { > + next = chunk->next; > + > + /* Wholesale poison the entire block. */ > + asan_poison(chunk, BUDDY_POISONED, > + BUDDY_CHUNK_PAGES * __PAGE_SIZE); > + bpf_arena_free_pages(&arena, chunk, BUDDY_CHUNK_PAGES); > + } > + > + /* Free up any part of the address space that did not get used. */ > + buddy_unreserve_arena_vaddr(buddy); > + > + /* Clear all fields. */ > + buddy->first_chunk = NULL; > + > + return 0; > +} > + > +__weak u64 buddy_chunk_alloc(buddy_chunk_t __arg_arena *chunk, int order_req) > +{ > + buddy_header_t *header, *tmp_header, *next_header; > + u32 idx, tmpidx, retidx; > + u64 address; > + u64 order = 0; > + u64 i; > + > + for (order = order_req; order < BUDDY_CHUNK_NUM_ORDERS && can_loop; order++) { > + if (chunk->freelists[order] != BUDDY_CHUNK_ITEMS) > + break; > + } > + > + if (order >= BUDDY_CHUNK_NUM_ORDERS) > + return (u64)NULL; > + > + retidx = chunk->freelists[order]; > + header = idx_to_header(chunk, retidx); > + if (unlikely(!header)) > + return (u64) NULL; > + > + chunk->freelists[order] = header->next_index; > + > + if (header->next_index != BUDDY_CHUNK_ITEMS) { > + next_header = idx_to_header(chunk, header->next_index); > + next_header->prev_index = BUDDY_CHUNK_ITEMS; > + } > + > + header->prev_index = BUDDY_CHUNK_ITEMS; > + header->next_index = BUDDY_CHUNK_ITEMS; > + if (idx_set_order(chunk, retidx, order_req)) > + return (u64)NULL; > + > + if (idx_set_allocated(chunk, retidx, true)) > + return (u64)NULL; > + > + /* > + * Do not unpoison the address yet, will be done by the caller > + * because the caller has the exact allocation size requested. > + */ > + address = (u64)idx_to_addr(chunk, retidx); > + if (!address) > + return (u64)NULL; > + > + /* If we allocated from a larger-order chunk, split the buddies. */ > + for (i = order_req; i < order && can_loop; i++) { > + /* > + * Flip the bit for the current order (the bit is guaranteed > + * to be 0, so just add 1 << i). > + */ > + idx = retidx + (1 << i); > + > + /* Add the buddy of the allocation to the free list. */ > + header = idx_to_header(chunk, idx); > + /* Unpoison the buddy header */ > + asan_unpoison(header, sizeof(*header)); > + > + if (idx_set_order(chunk, idx, i)) > + return (u64)NULL; > + > + /* Push the header to the beginning of the freelists list. */ > + tmpidx = chunk->freelists[i]; > + > + header->prev_index = BUDDY_CHUNK_ITEMS; > + header->next_index = tmpidx; > + > + if (tmpidx != BUDDY_CHUNK_ITEMS) { > + tmp_header = idx_to_header(chunk, tmpidx); > + tmp_header->prev_index = idx; > + } > + > + chunk->freelists[i] = idx; > + } > + > + return address; > +} > + > +/* Scan the existing chunks for available memory. */ > +static u64 buddy_alloc_from_existing_chunks(buddy_t *buddy, int order) > +{ > + buddy_chunk_t *chunk; > + u64 address; > + > + for (chunk = buddy->first_chunk; chunk != NULL && can_loop; > + chunk = chunk->next) { > + address = buddy_chunk_alloc(chunk, order); > + if (address) > + return address; > + } > + > + return (u64)NULL; > +} > + > +/* > + * Try an allocation from a newly allocated chunk. Also > + * incorporate the chunk into the linked list. > + */ > +static u64 buddy_alloc_from_new_chunk(buddy_t *buddy, buddy_chunk_t *chunk, int order) > +{ > + u64 address; > + > + if (buddy_lock(buddy)) > + return (u64)NULL; > + > + > + /* > + * Add the chunk into the allocator and try > + * to allocate specifically from that chunk. > + */ > + chunk->next = buddy->first_chunk; > + buddy->first_chunk = chunk; > + > + address = buddy_chunk_alloc(buddy->first_chunk, order); > + > + buddy_unlock(buddy); > + > + return (u64)address; > +} > +__weak > +u64 buddy_alloc_internal(buddy_t __arg_arena *buddy, size_t size) > +{ > + buddy_chunk_t *chunk; > + u64 address = (u64)NULL; > + int order; > + > + if (!buddy) > + return (u64)NULL; > + > + order = size_to_order(size); > + if (order >= BUDDY_CHUNK_NUM_ORDERS || order < 0) { > + arena_stderr("invalid order %d (sz %lu)\n", order, size); > + return (u64)NULL; > + } > + > + if (buddy_lock(buddy)) > + return (u64)NULL; > + > + address = buddy_alloc_from_existing_chunks(buddy, order); > + buddy_unlock(buddy); > + if (address) > + goto done; > + > + /* Get a new chunk. */ > + chunk = buddy_chunk_get(buddy); > + if (chunk) > + address = buddy_alloc_from_new_chunk(buddy, chunk, order); > + > +done: > + /* If we failed to allocate memory, return NULL. */ > + if (!address) > + return (u64)NULL; > + > + /* > + * Unpoison exactly the amount of bytes requested. If the > + * data is smaller than the header, we must poison any > + * unused bytes that were part of the header. > + */ > + if (size < BUDDY_HEADER_OFF + sizeof(buddy_header_t)) > + asan_poison((u8 __arena *)address + BUDDY_HEADER_OFF, > + BUDDY_POISONED, sizeof(buddy_header_t)); > + > + asan_unpoison((u8 __arena *)address, size); > + > + return address; > +} > + > +static __always_inline int buddy_free_unlocked(buddy_t *buddy, u64 addr) > +{ > + buddy_header_t *header, *buddy_header; > + u64 idx, buddy_idx, tmp_idx; > + buddy_chunk_t *chunk; > + bool allocated; > + u8 order; > + int ret; > + > + if (!buddy) > + return -EINVAL; > + > + if (addr & (BUDDY_MIN_ALLOC_BYTES - 1)) { > + arena_stderr("Freeing unaligned address %llx\n", addr); > + return -EINVAL; > + } > + > + /* Get (chunk, idx) out of the address. */ > + chunk = (void __arena *)(addr & ~BUDDY_CHUNK_OFFSET_MASK); > + idx = (addr & BUDDY_CHUNK_OFFSET_MASK) / BUDDY_MIN_ALLOC_BYTES; > + > + /* Mark the block as unallocated so we can access the header. */ > + ret = idx_set_allocated(chunk, idx, false); > + if (ret) > + return ret; > + > + order = idx_get_order(chunk, idx); > + header = idx_to_header(chunk, idx); > + > + /* The header is in the block itself, keep it unpoisoned. */ > + asan_poison((u8 __arena *)addr, BUDDY_POISONED, > + BUDDY_MIN_ALLOC_BYTES << order); > + asan_unpoison(header, sizeof(*header)); > + > + /* > + * Coalescing loop. Merge with free buddies of equal order. > + * For every coalescing step, keep the left buddy and > + * drop the right buddy's header. > + */ > + for (; order < BUDDY_CHUNK_NUM_ORDERS && can_loop; order++) { > + buddy_idx = idx ^ (1 << order); > + > + /* Check if the buddy is actually free. */ > + idx_is_allocated(chunk, buddy_idx, &allocated); > + if (allocated) > + break; > + > + /* > + * If buddy is not the same order as the chunk > + * being freed, then we're done coalescing. > + */ > + if (idx_get_order(chunk, buddy_idx) != order) > + break; > + > + buddy_header = idx_to_header(chunk, buddy_idx); > + header_remove_freelist(chunk, buddy_header, order); > + > + /* Keep the left header out of the two buddies, drop the other one. */ > + if (buddy_idx < idx) { > + tmp_idx = idx; > + idx = buddy_idx; > + buddy_idx = tmp_idx; > + } > + > + /* Remove the buddy from the freelists so that we can merge it. */ > + idx_set_order(chunk, buddy_idx, order); > + > + buddy_header = idx_to_header(chunk, buddy_idx); > + asan_poison(buddy_header, BUDDY_POISONED, > + sizeof(*buddy_header)); > + } > + > + /* Header properly freed but not in any freelists yet .*/ > + idx_set_order(chunk, idx, order); > + > + header = idx_to_header(chunk, idx); > + header_add_freelist(chunk, header, idx, order); > + > + return 0; > +} > + > +__weak int buddy_free_internal(buddy_t __arg_arena *buddy, u64 addr) > +{ > + int ret; > + > + if (!buddy) > + return -EINVAL; > + > + /* Freeing NULL is a valid no-op. */ > + if (!addr) > + return 0; > + > + ret = buddy_lock(buddy); > + if (ret) > + return ret; > + > + ret = buddy_free_unlocked(buddy, addr); > + > + buddy_unlock(buddy); > + > + return ret; > +} > + > +__weak char _license[] SEC("license") = "GPL"; > diff --git a/tools/testing/selftests/bpf/libarena/src/common.bpf.c b/tools/testing/selftests/bpf/libarena/src/common.bpf.c > index 1c185b75d799..e142c9594b5c 100644 > --- a/tools/testing/selftests/bpf/libarena/src/common.bpf.c > +++ b/tools/testing/selftests/bpf/libarena/src/common.bpf.c > @@ -1,11 +1,15 @@ > // SPDX-License-Identifier: LGPL-2.1 OR BSD-2-Clause > /* Copyright (c) 2026 Meta Platforms, Inc. and affiliates. */ > #include > +#include > +#include > > #include > > const volatile u32 zero = 0; > > +buddy_t buddy; > + > /* How many pages do we reserve at the beginning of the arena segment? */ > #define RESERVE_ALLOC (8) > > @@ -61,4 +65,23 @@ __weak int arena_alloc_reserve(void) > return bpf_arena_reserve_pages(&arena, NULL, RESERVE_ALLOC); > } > > +SEC("syscall") > +__weak int arena_buddy_reset(void) > +{ > + buddy_destroy(&buddy); > + > + return buddy_init(&buddy); > +} Feel as though introducing a trivial helper buddy_reset() (wrapping this exact dance) within the buddy API would be appropriate here. > +__weak u64 malloc_internal(size_t size) > +{ > + return buddy_alloc_internal(&buddy, size); > +} > + > +__weak void free(void __arg_arena __arena *ptr) > +{ > + buddy_free_internal(&buddy, (u64)ptr); > +} > + > + > char _license[] SEC("license") = "GPL"; > -- > 2.53.0 >