* [PATCH RFC v3 0/2] Task local data API
@ 2025-04-25 21:40 Amery Hung
2025-04-25 21:40 ` [PATCH RFC v3 1/2] selftests/bpf: Introduce task local data Amery Hung
` (2 more replies)
0 siblings, 3 replies; 18+ messages in thread
From: Amery Hung @ 2025-04-25 21:40 UTC (permalink / raw)
To: bpf
Cc: netdev, alexei.starovoitov, andrii, daniel, tj, martin.lau,
ameryhung, kernel-team
Hi,
This a respin of uptr KV store. It is renamed to task local data (TLD)
as the problem statement and the solution have changed, and it now draws
more similarities to pthread thread-specific data.
* Overview *
This patchset is a continuation of the original UPTR work[0], which aims
to provide a fast way for user space programs to pass per-task hints to
sched_ext schedulers. UPTR built the foundation by supporting sharing
user pages with bpf programs through task local storage maps.
Additionally, sched_ext would like to allow multiple developers to share
a storage without the need to explicitly agreeing on the layout of it.
This simplify code base management and makes experimenting easier.
While a centralized storage layout definition would have worked, the
friction of synchronizing it across different repos is not desirable.
This patchset contains the user space plumbing so that user space and bpf
program developers can exchange per-task hints easily through simple
interfaces.
* Design *
BPF task local data is a simple API for sharing task-specific data
between user space and bpf programs, where data are refered to using
string keys. As shown in the following figure, user space programs can
define a task local data using bpf_tld_type_var(). The data is
effectively a variable declared with __thread, which every thread owns an
independent copy and can be directly accessed. On the bpf side, a task
local data first needs to be initialized for every new task once (e.g.,
in sched_ext_ops::init_task) using bpf_tld_init_var(). Then, other bpf
programs can get a pointer to the data using bpf_tld_lookup(). The task
local data APIs refer to data using string keys so developers
does not need to deal with addresses of data in a shared storage.
┌─ Application ─────────────────────────────────────────┐
│ ┌─ library A ──────────────┐ │
│ bpf_tld_type_var(int, X) │ bpf_tld_type_var(int, Y) │ │
│ └┬─────────────────────────┘ │
└───────┬───────────────────│───────────────────────────┘
│ X = 123; │ Y = true;
V V
+ ─ Task local data ─ ─ ─ ─ ─ ─ +
| ┌─ task_kvs_map ────────────┐ | ┌─ sched_ext_ops::init_task ──────┐
| │ BPF Task local storage │ | │ bpf_tld_init_var(&kvs, X); │
| │ ┌───────────────────┐ │ |<─┤ bpf_tld_init_var(&kvs, Y); │
| │ │ __uptr *udata │ │ | └─────────────────────────────────┘
| │ └───────────────────┘ │ |
| │ ┌───────────────────┐ │ | ┌─ Other sched_ext_ops op ────────┐
| │ │ __uptr *umetadata │ │ | │ int *y; ├┐
| │ └───────────────────┘ │ |<─┤ y = bpf_tld_lookup(&kvs, Y, 1); ││
| └───────────────────────────┘ | │ if (y) ││
| ┌─ task_kvs_off_map ────────┐ | │ /* do something */ ││
| │ BPF Task local storage │ | └┬────────────────────────────────┘│
| └───────────────────────────┘ | └─────────────────────────────────┘
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ +
* Implementation *
Task local data API hides the memory management from the developers.
Internally, it shares user data with bpf programs through udata UPTRs.
Task local data from different compilation units are placed into a
custom "udata" section by the declaration API, bpf_tld_type_var(), so
that they are placed together in the memory. User space will need to
call bpf_tld_thread_init() for every new thread to pin udata pages to
kernel.
The metadata used to address udata is stored in umetadata UPTR. It is
generated by constructors inserted by bpf_tld_type_var() and
bpf_tld_thread_init(). umetadata is an array of 64 metadata corresponding
to each data, which contains the key and the offset of data in udata.
During initialization, bpf_tld_init_var() will search umetadata for
a matching key and cache its offset in task_kvs_off_map. Later,
bpf_tld_lookup() will use the cached offset to retreive a pointer to
udata.
* Limitation *
Currently, it is assumed all key-value pairs are known as a program
starts. All compilation units using task local data should be statically
linked together so that values are all placed together in a udata section
and therefore can be shared with bpf through two UPTRs. The next
iteration will explore how bpf task local data can work in dynamic
libraries. Maybe more udata UPTRs will be added to pin page of TLS
of dynamically loaded modules. Or maybe it will allocate memory for data
instead of relying on __thread, and change how user space interact with
task local data slightly. The later approach can also save some troubles
dealing with the restriction of UPTR.
Some other limitations:
- Total task local data cannot exceed a page
- Only support 64 task local data
- Some memory waste for data whose size is not power of two
due to UPTR limitation
[0] https://lore.kernel.org/bpf/20241023234759.860539-1-martin.lau@linux.dev/
Amery Hung (2):
selftests/bpf: Introduce task local data
selftests/bpf: Test basic workflow of task local data
.../bpf/prog_tests/task_local_data.c | 159 +++++++++++++++
.../bpf/prog_tests/task_local_data.h | 58 ++++++
.../bpf/prog_tests/test_task_local_data.c | 156 +++++++++++++++
.../selftests/bpf/progs/task_local_data.h | 181 ++++++++++++++++++
.../bpf/progs/test_task_local_data_basic.c | 78 ++++++++
.../selftests/bpf/task_local_data_common.h | 49 +++++
6 files changed, 681 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/task_local_data.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/task_local_data.h
create mode 100644 tools/testing/selftests/bpf/prog_tests/test_task_local_data.c
create mode 100644 tools/testing/selftests/bpf/progs/task_local_data.h
create mode 100644 tools/testing/selftests/bpf/progs/test_task_local_data_basic.c
create mode 100644 tools/testing/selftests/bpf/task_local_data_common.h
--
2.47.1
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH RFC v3 1/2] selftests/bpf: Introduce task local data
2025-04-25 21:40 [PATCH RFC v3 0/2] Task local data API Amery Hung
@ 2025-04-25 21:40 ` Amery Hung
2025-04-30 1:44 ` Alexei Starovoitov
2025-05-01 20:38 ` Andrii Nakryiko
2025-04-25 21:40 ` [PATCH RFC v3 2/2] selftests/bpf: Test basic workflow of " Amery Hung
2025-05-01 20:37 ` [PATCH RFC v3 0/2] Task local data API Andrii Nakryiko
2 siblings, 2 replies; 18+ messages in thread
From: Amery Hung @ 2025-04-25 21:40 UTC (permalink / raw)
To: bpf
Cc: netdev, alexei.starovoitov, andrii, daniel, tj, martin.lau,
ameryhung, kernel-team
Task local data provides simple and fast bpf and user space APIs to
exchange per-task data through task local storage map. The user space
first declares task local data using bpf_tld_type_key_var() or
bpf_tld_type_var(). The data is a thread-specific variable which
every thread has its own copy. Then, a bpf_tld_thread_init() needs to
be called for every thread to share the data with bpf. Finally, users
can directly read/write thread local data without bpf syscall.
For bpf programs to see task local data, every data need to be
initialized once for every new task using bpf_tld_init_var(). Then
bpf programs can retrieve pointers to the data with bpf_tld_lookup().
The core of task local storage implementation relies on UPTRs. They
pin user pages to the kernel so that user space can share data with bpf
programs without syscalls. Both data and the metadata used to access
data are pinned via UPTRs.
A limitation of UPTR makes the implementation of task local data
less trivial than it sounds: memory pinned to UPTR cannot exceed a
page and must not cross the page boundary. In addition, the data
declaration uses __thread identifier and therefore does not have
directly control over the memory allocation. Therefore, several
tricks and checks are used to make it work.
First, task local data declaration APIs place data in a custom "udata"
section so that data from different compilation units will be contiguous
in the memory and can be pinned using two UPTRs if they are smaller than
one page.
To avoid each data from spanning across two pages, they are each aligned
to the smallest power of two larget than their sizes.
As we don't control the memory allocation for data, we need to figure
out the layout of user defined data. This is done by the data
declaration API and bpf_tld_thread_init(). The data declaration API
will insert constructors for all data, and they are used to find the
size and offset of data as well as the beginning and end of the whole
udata section. Then, bpf_tld_thread_init() performs a per-thread check
to make sure no data will cross the page boundary as udata can start at
different offset in a page.
Note that for umetadata, we directly aligned_alloc() memory for it and
assigned to the UPTR. This is only done once for every process as every
tasks shares the same umetadata. The actual thread-specific data offset
will be adjusted in the bpf program when calling bpf_tld_init_var().
Signed-off-by: Amery Hung <ameryhung@gmail.com>
---
.../bpf/prog_tests/task_local_data.c | 159 +++++++++++++++
.../bpf/prog_tests/task_local_data.h | 58 ++++++
.../selftests/bpf/progs/task_local_data.h | 181 ++++++++++++++++++
.../selftests/bpf/task_local_data_common.h | 41 ++++
4 files changed, 439 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/task_local_data.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/task_local_data.h
create mode 100644 tools/testing/selftests/bpf/progs/task_local_data.h
create mode 100644 tools/testing/selftests/bpf/task_local_data_common.h
diff --git a/tools/testing/selftests/bpf/prog_tests/task_local_data.c b/tools/testing/selftests/bpf/prog_tests/task_local_data.c
new file mode 100644
index 000000000000..5a21514573d2
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/task_local_data.c
@@ -0,0 +1,159 @@
+#include <fcntl.h>
+#include <errno.h>
+#include <stdio.h>
+#include <pthread.h>
+
+#include <bpf/bpf.h>
+
+#include "bpf_util.h"
+#include "task_local_data.h"
+#include "task_local_storage_helpers.h"
+
+#define PIDFD_THREAD O_EXCL
+
+/* To find the start of udata for each thread, insert a dummy variable to udata.
+ * Contructors generated for every task local data will figured out the offset
+ * from the beginning of udata to the dummy symbol. Then, every thread can infer
+ * the start of udata by subtracting the offset from the address of dummy.
+ */
+static __thread struct udata_dummy {} udata_dummy SEC("udata");
+
+static __thread bool task_local_data_thread_inited;
+
+struct task_local_data {
+ void *udata_start;
+ void *udata_end;
+ int udata_start_dummy_off;
+ struct meta_page *umetadata;
+ int umetadata_cnt;
+ bool umetadata_init;
+ short udata_sizes[64];
+ pthread_mutex_t lock;
+} task_local_data = {
+ .udata_start = (void *)-1UL,
+ .lock = PTHREAD_MUTEX_INITIALIZER,
+};
+
+static void tld_set_data_key_meta(int i, const char *key, short off)
+{
+ task_local_data.umetadata->meta[i].off = off;
+ strncpy(task_local_data.umetadata->meta[i].key, key, TASK_LOCAL_DATA_KEY_LEN);
+}
+
+static struct key_meta *tld_get_data_key_meta(int i)
+{
+ return &task_local_data.umetadata->meta[i];
+}
+
+static void tld_set_data_size(int i, int size)
+{
+ task_local_data.udata_sizes[i] = size;
+}
+
+static int tld_get_data_size(int i)
+{
+ return task_local_data.udata_sizes[i];
+}
+
+void __bpf_tld_var_init(const char *key, void *var, int size)
+{
+ int i;
+
+ i = task_local_data.umetadata_cnt++;
+
+ if (!task_local_data.umetadata) {
+ if (task_local_data.umetadata_cnt > 1)
+ return;
+
+ task_local_data.umetadata = aligned_alloc(PAGE_SIZE, PAGE_SIZE);
+ if (!task_local_data.umetadata)
+ return;
+ }
+
+ if (var < task_local_data.udata_start) {
+ task_local_data.udata_start = var;
+ task_local_data.udata_start_dummy_off =
+ (void *)&udata_dummy - task_local_data.udata_start;
+ }
+
+ if (var + size > task_local_data.udata_end)
+ task_local_data.udata_end = var + size;
+
+ tld_set_data_key_meta(i, key, var - (void *)&udata_dummy);
+ tld_set_data_size(i, size);
+}
+
+int bpf_tld_thread_init(void)
+{
+ unsigned long udata_size, udata_start, udata_start_page, udata_end_page;
+ struct task_local_data_map_value map_val;
+ int i, task_id, task_fd, map_fd, err;
+
+ if (!task_local_data.umetadata_cnt || task_local_data_thread_inited)
+ return 0;
+
+ if (task_local_data.umetadata_cnt && !task_local_data.umetadata)
+ return -ENOMEM;
+
+ udata_start = (unsigned long)&udata_dummy + task_local_data.udata_start_dummy_off;
+
+ pthread_mutex_lock(&task_local_data.lock);
+ for (i = 0; i < task_local_data.umetadata_cnt; i++) {
+ struct key_meta *km = tld_get_data_key_meta(i);
+ int size = tld_get_data_size(i);
+ int off;
+
+ if (!task_local_data.umetadata_init) {
+ /* Constructors save the offset from udata_dummy to each data
+ * Now as all ctors have run and the offset between the start of
+ * udata and udata_dummy is known, adjust the offsets of data
+ * to be relative to the start of udata
+ */
+ km->off -= task_local_data.udata_start_dummy_off;
+
+ /* Data exceeding a page may not be able to be covered by
+ * two udata UPTRs in every thread
+ */
+ if (km->off >= PAGE_SIZE)
+ return -EOPNOTSUPP;
+ }
+
+ /* A task local data should not span across two pages. */
+ off = km->off + udata_start;
+ if ((off & PAGE_MASK) != ((off + size - 1) & PAGE_MASK))
+ return -EOPNOTSUPP;
+ }
+ task_local_data.umetadata_init = true;
+ pthread_mutex_unlock(&task_local_data.lock);
+
+ udata_size = task_local_data.udata_end - task_local_data.udata_start;
+ udata_start_page = udata_start & PAGE_MASK;
+ udata_end_page = (udata_start + udata_size) & PAGE_MASK;
+
+ /* The whole udata can span across two pages for a thread. Use two UPTRs
+ * to cover the second page in case it happens.
+ */
+ map_val.udata_start = udata_start & ~PAGE_MASK;
+ map_val.udata[0].page = (struct data_page *)(udata_start_page);
+ map_val.udata[1].page = (udata_start_page == udata_end_page) ? NULL :
+ (struct data_page *)(udata_start_page + PAGE_SIZE);
+
+ /* umetadata is shared by all threads under the assumption that all
+ * task local data are defined statically and linked together
+ */
+ map_val.umetadata = task_local_data.umetadata;
+ map_val.umetadata_cnt = task_local_data.umetadata_cnt;
+
+ map_fd = bpf_obj_get(TASK_LOCAL_DATA_MAP_PIN_PATH);
+ if (map_fd < 0)
+ return -errno;
+
+ task_id = sys_gettid();
+ task_fd = sys_pidfd_open(task_id, PIDFD_THREAD);
+ err = bpf_map_update_elem(map_fd, &task_fd, &map_val, 0);
+ if (err)
+ return err;
+
+ task_local_data_thread_inited = true;
+ return 0;
+}
diff --git a/tools/testing/selftests/bpf/prog_tests/task_local_data.h b/tools/testing/selftests/bpf/prog_tests/task_local_data.h
new file mode 100644
index 000000000000..c928e8d2c0a6
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/task_local_data.h
@@ -0,0 +1,58 @@
+#ifndef __BPF_TASK_LOCAL_DATA_H__
+#define __BPF_TASK_LOCAL_DATA_H__
+
+#include "task_local_data_common.h"
+
+#define SEC(name) __attribute__((section(name), used))
+#define __aligned(x) __attribute__((aligned(x)))
+
+#define ROUND_UP_POWER_OF_TWO(x) (1UL << (sizeof(x) * 8 - __builtin_clzl(x - 1)))
+
+void __bpf_tld_var_init(const char *key, void *var, int size);
+
+/**
+ * @brief bpf_tld_key_type_var() declares a task local data shared with bpf
+ * programs. The data will be a thread-specific variable which a user space
+ * program can directly read/write, while bpf programs will need to lookup
+ * with the string key.
+ *
+ * @param key The string key a task local data will be associated with. The
+ * string will be truncated if the length exceeds TASK_LOCAL_DATA_KEY_LEN
+ * @param type The type of the task local data
+ * @param var The name of the task local data
+ */
+#define bpf_tld_key_type_var(key, type, var) \
+__thread type var SEC("udata") __aligned(ROUND_UP_POWER_OF_TWO(sizeof(type))); \
+ \
+__attribute__((constructor)) \
+void __bpf_tld_##var##_init(void) \
+{ \
+ _Static_assert(sizeof(type) < PAGE_SIZE, \
+ "data size must not exceed a page"); \
+ __bpf_tld_var_init(key, &var, sizeof(type)); \
+}
+
+/**
+ * @brief bpf_tld_key_type_var() declares a task local data shared with bpf
+ * programs. The data will be a thread-specific variable which a user space
+ * program can directly read/write, while bpf programs will need to lookup
+ * the data with the string key same as the variable name.
+ *
+ * @param type The type of the task local data
+ * @param var The name of the task local data as well as the name of the
+ * key. The key string will be truncated if the length exceeds
+ * TASK_LOCAL_DATA_KEY_LEN.
+ */
+#define bpf_tld_type_var(type, var) \
+ bpf_tld_key_type_var(#var, type, var)
+
+/**
+ * @brief bpf_tld_thread_init() initializes the task local data for the current
+ * thread. All data are undefined from a bpf program's point of view until
+ * bpf_tld_thread_init() is called.
+ *
+ * @return 0 on success; negative error number on failure
+ */
+int bpf_tld_thread_init(void);
+
+#endif
diff --git a/tools/testing/selftests/bpf/progs/task_local_data.h b/tools/testing/selftests/bpf/progs/task_local_data.h
new file mode 100644
index 000000000000..7358993ee634
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/task_local_data.h
@@ -0,0 +1,181 @@
+#include <errno.h>
+#include <bpf/bpf_helpers.h>
+
+#include "task_local_data_common.h"
+
+#define PAGE_IDX_MASK 0x8000
+
+/* Overview
+ *
+ * Task local data (TLD) allows sharing per-task information between users and
+ * bpf programs without explicit storage layout managenent. TLD APIs use to
+ * string keys to access data. Internally, TLD shares user pages throguh two
+ * UPTRs in a task local storage: udata and umetadata. Data are stored in udata.
+ * Keys and the offsets of the corresponding data in udata are stored in umetadata.
+ *
+ * Usage
+ *
+ * Users should initialize every task local data once for every new task before
+ * using them with bpf_tld_init_var(). It should be done ideally in non-critical
+ * paths first (e.g., sched_ext_ops::init_task) as it compare key strings and
+ * cache the offsets of data.
+ *
+ * First, user should define struct task_local_data_offsets, which will be used
+ * to cache the offsets of task local data. Each member of the struct should
+ * be a short integer with name same as the key name defined in the user space.
+ * Another task local storage map will be created to save the offsets. For example:
+ *
+ * struct task_local_data_offsets {
+ * short priority;
+ * short in_critical_section;
+ * };
+ *
+ * Task local data APIs take a pointer to bpf_task_local_data object as the first
+ * argument. The object should be declared as a stack variable and initialized via
+ * bpf_tld_init(). Then, in a bpf program, to cache the offset for a key-value pair,
+ * call bpf_tld_init_var(). For example, in init_task program:
+ *
+ * struct bpf_task_local_data tld;
+ *
+ * err = bpf_tld_init(task, &tld);
+ * if (err)
+ * return 0;
+ *
+ * bpf_tld_init_var(&tld, priority);
+ * bpf_tld_init_var(&tld, in_critical_section);
+ *
+ * Subsequently and in other bpf programs, to lookup task local data, call
+ * bpf_tld_lookup(). For example:
+ *
+ * int *p;
+ *
+ * p = bpf_tld_lookup(&tld, priority, sizeof(int));
+ * if (p)
+ * // do something depending on *p
+ */
+
+struct task_local_data_offsets;
+
+struct {
+ __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
+ __uint(map_flags, BPF_F_NO_PREALLOC);
+ __type(key, int);
+ __type(value, struct task_local_data_map_value);
+ __uint(pinning, LIBBPF_PIN_BY_NAME);
+} task_local_data_map SEC(".maps");
+
+struct {
+ __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
+ __uint(map_flags, BPF_F_NO_PREALLOC);
+ __type(key, int);
+ __type(value, struct task_local_data_offsets);
+} task_local_data_off_map SEC(".maps");
+
+struct bpf_task_local_data {
+ struct task_local_data_map_value *data_map;
+ struct task_local_data_offsets *off_map;
+};
+
+/**
+ * @brief bpf_tld_init() initializes a bpf_task_local_data object.
+ *
+ * @param task The task_struct of the target task
+ * @param tld A pointer to a bpf_task_local_data object to be initialized
+ * @return -EINVAL if task KV store is not initialized by the user space for this task;
+ * -ENOMEM if cached offset map creation fails. In both cases, users can abort, or
+ * conitnue without calling task KV store APIs as if no key-value pairs are set.
+ */
+__attribute__((unused))
+static int bpf_tld_init(struct task_struct *task, struct bpf_task_local_data *tld)
+{
+ tld->data_map = bpf_task_storage_get(&task_local_data_map, task, 0, 0);
+ if (!tld->data_map)
+ return -EINVAL;
+
+ tld->off_map = bpf_task_storage_get(&task_local_data_off_map, task, 0,
+ BPF_LOCAL_STORAGE_GET_F_CREATE);
+ if (!tld->off_map)
+ return -ENOMEM;
+
+ return 0;
+}
+
+/**
+ * @brief bpf_tld_init_var() lookups the metadata with a key and caches the offset of
+ * the value corresponding to the key.
+ *
+ * @param tld A pointer to a valid bpf_task_local_data object initialized by bpf_tld_init()
+ * @param key The key used to lookup the task KV store. Should be one of the
+ * symbols defined in struct task_local_data_offsets, not a string
+ */
+#define bpf_tld_init_var(tld, key) \
+ ({ \
+ (tld)->off_map->key = bpf_tld_fetch_off(tld, #key); \
+ })
+
+__attribute__((unused))
+static short bpf_tld_fetch_off(struct bpf_task_local_data *tld, const char *key)
+{
+ int i, umetadata_off, umetadata_cnt, udata_start;
+ void *umetadata, *key_i, *off_i;
+ short off = 0;
+
+ if (!tld->data_map || !tld->data_map->umetadata)
+ goto out;
+
+ udata_start = tld->data_map->udata_start;
+ umetadata_cnt = tld->data_map->umetadata_cnt;
+ umetadata = tld->data_map->umetadata->meta;
+
+ bpf_for(i, 0, umetadata_cnt) {
+ umetadata_off = i * sizeof(struct key_meta);
+ if (umetadata_off > PAGE_SIZE - sizeof(struct key_meta))
+ break;
+
+ key_i = umetadata + umetadata_off + offsetof(struct key_meta, key);
+ off_i = umetadata + umetadata_off + offsetof(struct key_meta, off);
+
+ if (!bpf_strncmp(key_i, TASK_LOCAL_DATA_KEY_LEN, key)) {
+ off = *(short *)(off_i) + udata_start;
+ if (off >= PAGE_SIZE)
+ off = (off - PAGE_SIZE) | PAGE_IDX_MASK;
+ /* Shift cached offset by 1 so that 0 means not initialized */
+ off += 1;
+ break;
+ }
+ }
+out:
+ return off;
+}
+
+/**
+ * @brief bpf_tld_lookup() lookups the task KV store using the cached offset
+ * corresponding to the key.
+ *
+ * @param tld A pointer to a valid bpf_task_local_data object initialized by bpf_tld_init()
+ * @param key The key used to lookup the task KV store. Should be one of the
+ * symbols defined in struct task_local_data_offsets, not a string
+ * @param size The size of the value. Must be a known constant value
+ * @return A pointer to the value corresponding to the key; NULL if the offset
+ * if not cached or the size is too big
+ */
+#define bpf_tld_lookup(tld, key, size) __bpf_tld_lookup(tld, (tld)->off_map->key, size)
+__attribute__((unused))
+static void *__bpf_tld_lookup(struct bpf_task_local_data *tld, short cached_off, int size)
+{
+ short page_off, page_idx;
+
+ if (!cached_off--)
+ return NULL;
+
+ page_off = cached_off & ~PAGE_IDX_MASK;
+ page_idx = !!(cached_off & PAGE_IDX_MASK);
+
+ if (page_idx) {
+ return (tld->data_map->udata[1].page && page_off < PAGE_SIZE - size) ?
+ (void *)tld->data_map->udata[1].page + page_off : NULL;
+ } else {
+ return (tld->data_map->udata[0].page && page_off < PAGE_SIZE - size) ?
+ (void *)tld->data_map->udata[0].page + page_off : NULL;
+ }
+}
diff --git a/tools/testing/selftests/bpf/task_local_data_common.h b/tools/testing/selftests/bpf/task_local_data_common.h
new file mode 100644
index 000000000000..2a0bb724c77c
--- /dev/null
+++ b/tools/testing/selftests/bpf/task_local_data_common.h
@@ -0,0 +1,41 @@
+#ifndef __BPF_TASK_KV_STORE_COMMON_H__
+#define __BPF_TASK_KV_STORE_COMMON_H__
+
+#ifdef __BPF__
+struct data_page *dummy_data_page;
+struct meta_page *dummy_meta_page;
+#else
+#define __uptr
+#endif
+
+
+#define TASK_LOCAL_DATA_MAP_PIN_PATH "/sys/fs/bpf/task_local_data_map"
+#define TASK_LOCAL_DATA_KEY_LEN 62
+#define PAGE_SIZE 4096
+#define PAGE_MASK (~(PAGE_SIZE - 1))
+
+struct data_page {
+ char data[PAGE_SIZE];
+};
+
+struct data_page_entry {
+ struct data_page __uptr *page;
+};
+
+struct key_meta {
+ char key[TASK_LOCAL_DATA_KEY_LEN];
+ short off;
+};
+
+struct meta_page {
+ struct key_meta meta[64];
+};
+
+struct task_local_data_map_value {
+ struct data_page_entry udata[2];
+ struct meta_page __uptr *umetadata;
+ short umetadata_cnt;
+ short udata_start;
+};
+
+#endif
--
2.47.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH RFC v3 2/2] selftests/bpf: Test basic workflow of task local data
2025-04-25 21:40 [PATCH RFC v3 0/2] Task local data API Amery Hung
2025-04-25 21:40 ` [PATCH RFC v3 1/2] selftests/bpf: Introduce task local data Amery Hung
@ 2025-04-25 21:40 ` Amery Hung
2025-04-25 22:12 ` Tejun Heo
2025-05-01 20:37 ` [PATCH RFC v3 0/2] Task local data API Andrii Nakryiko
2 siblings, 1 reply; 18+ messages in thread
From: Amery Hung @ 2025-04-25 21:40 UTC (permalink / raw)
To: bpf
Cc: netdev, alexei.starovoitov, andrii, daniel, tj, martin.lau,
ameryhung, kernel-team
Test the workflow of task local data. A user space program first declares
task local data using two different APIs. As the test starts, it calls
bpf_tld_thread_init() for every new thread that would access the
storage. Then, values can be accessed directly. The user space triggers
two bpf programs: prog_init and prog_main. prog_init simulates a
sched_ext_ops::init_task, which runs only once for every new task. It
caches the offsets of values of the task. prog_main represents bpf
programs for normal operation. It reads the task local data and write the
result to global variables for verification.
The user space program will launch 32 threads to make sure not only
umetadata, but thread-specific udata and udata_start are handled
correctly. It is verified by writing values in user space, reading
them the bpf program and checking that they match. Also make sure
the data are indeed thread-specific. Finally, a large task local data is
declared to see if the declaration API prevents it from spanning across
two pages.
Signed-off-by: Amery Hung <ameryhung@gmail.com>
---
.../bpf/prog_tests/test_task_local_data.c | 156 ++++++++++++++++++
.../bpf/progs/test_task_local_data_basic.c | 78 +++++++++
.../selftests/bpf/task_local_data_common.h | 8 +
3 files changed, 242 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/test_task_local_data.c
create mode 100644 tools/testing/selftests/bpf/progs/test_task_local_data_basic.c
diff --git a/tools/testing/selftests/bpf/prog_tests/test_task_local_data.c b/tools/testing/selftests/bpf/prog_tests/test_task_local_data.c
new file mode 100644
index 000000000000..5754687026f3
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/test_task_local_data.c
@@ -0,0 +1,156 @@
+#include <pthread.h>
+
+#include <test_progs.h>
+#include <bpf/btf.h>
+
+#include "task_local_data.h"
+#include "test_task_local_data_basic.skel.h"
+
+#define TEST_THREAD_NUM 32
+
+/* Used to declare a large tasl local data below to see if bpf_tld_type_var() prevents
+ * a value from crossing the page boundary
+ */
+struct dummy {
+ char data[1000];
+};
+
+/* Declare task local data */
+bpf_tld_type_var(int, value1);
+bpf_tld_type_var(struct test_struct, value2);
+bpf_tld_type_var(struct dummy, dummy);
+bpf_tld_key_type_var("test_basic_value3", int, value3);
+bpf_tld_key_type_var("test_basic_value4", struct test_struct, value4);
+
+/* Serialize access to bpf program's global variables */
+static pthread_mutex_t global_mutex;
+
+static void run_prog_init(struct test_task_local_data_basic *skel, int tid)
+{
+ skel->bss->target_tid = tid;
+ (void)syscall(__NR_getuid);
+ skel->bss->target_tid = -1;
+}
+
+static void run_prog_main(struct test_task_local_data_basic *skel, int tid)
+{
+ skel->bss->target_tid = tid;
+ (void)syscall(__NR_gettid);
+ skel->bss->target_tid = -1;
+}
+
+void *test_task_local_data_basic_thread(void *arg)
+{
+ struct test_task_local_data_basic *skel = (struct test_task_local_data_basic *)arg;
+ int err, tid;
+
+ tid = gettid();
+
+ err = bpf_tld_thread_init();
+ if (!ASSERT_OK(err, "bpf_tld_thread_init"))
+ return NULL;
+
+ value1 = tid + 0;
+ value2.a = tid + 1;
+ value2.b = tid + 2;
+ value2.c = tid + 3;
+ value2.d = tid + 4;
+ value3 = tid + 5;
+ value4.a = tid + 6;
+ value4.b = tid + 7;
+ value4.c = tid + 8;
+ value4.d = tid + 9;
+
+ pthread_mutex_lock(&global_mutex);
+ /* Simulate an initialization bpf prog that runs once for every new task.
+ * The program caches data offsets for subsequent bpf programs
+ */
+ run_prog_init(skel, tid);
+ /* Run main prog that lookup task local data and save to global variables */
+ run_prog_main(skel, tid);
+ ASSERT_EQ(skel->bss->test_value1, tid + 0, "bpf_tld_lookup value1");
+ ASSERT_EQ(skel->bss->test_value2.a, tid + 1, "bpf_tld_lookup value2.a");
+ ASSERT_EQ(skel->bss->test_value2.b, tid + 2, "bpf_tld_lookup value2.b");
+ ASSERT_EQ(skel->bss->test_value2.c, tid + 3, "bpf_tld_lookup value2.c");
+ ASSERT_EQ(skel->bss->test_value2.d, tid + 4, "bpf_tld_lookup value2.d");
+ ASSERT_EQ(skel->bss->test_value3, tid + 5, "bpf_tld_lookup value3");
+ ASSERT_EQ(skel->bss->test_value4.a, tid + 6, "bpf_tld_lookup value4.a");
+ ASSERT_EQ(skel->bss->test_value4.b, tid + 7, "bpf_tld_lookup value4.b");
+ ASSERT_EQ(skel->bss->test_value4.c, tid + 8, "bpf_tld_lookup value4.c");
+ ASSERT_EQ(skel->bss->test_value4.d, tid + 9, "bpf_tld_lookup value4.d");
+ pthread_mutex_unlock(&global_mutex);
+
+ /* Make sure valueX are indeed local to threads */
+ ASSERT_EQ(value1, tid + 0, "value1");
+ ASSERT_EQ(value2.a, tid + 1, "value2.a");
+ ASSERT_EQ(value2.b, tid + 2, "value2.b");
+ ASSERT_EQ(value2.c, tid + 3, "value2.c");
+ ASSERT_EQ(value2.d, tid + 4, "value2.d");
+ ASSERT_EQ(value3, tid + 5, "value3");
+ ASSERT_EQ(value4.a, tid + 6, "value4.a");
+ ASSERT_EQ(value4.b, tid + 7, "value4.b");
+ ASSERT_EQ(value4.c, tid + 8, "value4.c");
+ ASSERT_EQ(value4.d, tid + 9, "value4.d");
+
+ value1 = tid + 9;
+ value2.a = tid + 8;
+ value2.b = tid + 7;
+ value2.c = tid + 6;
+ value2.d = tid + 5;
+ value3 = tid + 4;
+ value4.a = tid + 3;
+ value4.b = tid + 2;
+ value4.c = tid + 1;
+ value4.d = tid + 0;
+
+ /* Run main prog again */
+ pthread_mutex_lock(&global_mutex);
+ run_prog_main(skel, tid);
+ ASSERT_EQ(skel->bss->test_value1, tid + 9, "bpf_tld_lookup value1");
+ ASSERT_EQ(skel->bss->test_value2.a, tid + 8, "bpf_tld_lookup value2.a");
+ ASSERT_EQ(skel->bss->test_value2.b, tid + 7, "bpf_tld_lookup value2.b");
+ ASSERT_EQ(skel->bss->test_value2.c, tid + 6, "bpf_tld_lookup value2.c");
+ ASSERT_EQ(skel->bss->test_value2.d, tid + 5, "bpf_tld_lookup value2.d");
+ ASSERT_EQ(skel->bss->test_value3, tid + 4, "bpf_tld_lookup value3");
+ ASSERT_EQ(skel->bss->test_value4.a, tid + 3, "bpf_tld_lookup value4.a");
+ ASSERT_EQ(skel->bss->test_value4.b, tid + 2, "bpf_tld_lookup value4.b");
+ ASSERT_EQ(skel->bss->test_value4.c, tid + 1, "bpf_tld_lookup value4.c");
+ ASSERT_EQ(skel->bss->test_value4.d, tid + 0, "bpf_tld_lookup value4.d");
+ pthread_mutex_unlock(&global_mutex);
+
+ pthread_exit(NULL);
+}
+
+static void test_task_local_data_basic(void)
+{
+ struct test_task_local_data_basic *skel;
+ pthread_t thread[TEST_THREAD_NUM];
+ int i, err;
+
+ ASSERT_OK(pthread_mutex_init(&global_mutex, NULL), "pthread_mutex_init");
+
+ skel = test_task_local_data_basic__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "skel_open_and_load"))
+ return;
+
+ err = test_task_local_data_basic__attach(skel);
+ if (!ASSERT_OK(err, "skel_attach"))
+ goto out;
+
+ for (i = 0; i < TEST_THREAD_NUM; i++) {
+ err = pthread_create(&thread[i], NULL, test_task_local_data_basic_thread, skel);
+ if (!ASSERT_OK(err, "pthread_create"))
+ goto out;
+ }
+
+ for (i = 0; i < TEST_THREAD_NUM; i++)
+ pthread_join(thread[i], NULL);
+out:
+ unlink(TASK_LOCAL_DATA_MAP_PIN_PATH);
+}
+
+void test_task_local_data(void)
+{
+ if (test__start_subtest("task_local_data_basic"))
+ test_task_local_data_basic();
+}
diff --git a/tools/testing/selftests/bpf/progs/test_task_local_data_basic.c b/tools/testing/selftests/bpf/progs/test_task_local_data_basic.c
new file mode 100644
index 000000000000..345d7c6e37de
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_task_local_data_basic.c
@@ -0,0 +1,78 @@
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+
+#include "task_local_data.h"
+
+struct task_local_data_offsets {
+ short value1;
+ short value2;
+ short test_basic_value3;
+ short test_basic_value4;
+};
+
+pid_t target_tid = 0;
+int test_value1 = 0;
+struct test_struct test_value2;
+int test_value3 = 0;
+struct test_struct test_value4;
+
+SEC("tp/syscalls/sys_enter_getuid")
+int prog_init(void *ctx)
+{
+ struct bpf_task_local_data tld;
+ struct task_struct *task;
+ int err;
+
+ task = bpf_get_current_task_btf();
+ if (task->pid != target_tid)
+ return 0;
+
+ err = bpf_tld_init(task, &tld);
+ if (err)
+ return 0;
+
+ bpf_tld_init_var(&tld, value1);
+ bpf_tld_init_var(&tld, value2);
+ bpf_tld_init_var(&tld, test_basic_value3);
+ bpf_tld_init_var(&tld, test_basic_value4);
+
+ return 0;
+}
+
+SEC("tp/syscalls/sys_enter_gettid")
+int prog_main(void *ctx)
+{
+ struct bpf_task_local_data tld;
+ struct test_struct *struct_p;
+ struct task_struct *task;
+ int err, *int_p;
+
+ task = bpf_get_current_task_btf();
+ if (task->pid != target_tid)
+ return 0;
+
+ err = bpf_tld_init(task, &tld);
+ if (err)
+ return 0;
+
+ int_p = bpf_tld_lookup(&tld, value1, sizeof(int));
+ if (int_p)
+ test_value1 = *int_p;
+
+ struct_p = bpf_tld_lookup(&tld, value2, sizeof(struct test_struct));
+ if (struct_p)
+ test_value2 = *struct_p;
+
+ int_p = bpf_tld_lookup(&tld, test_basic_value3, sizeof(int));
+ if (int_p)
+ test_value3 = *int_p;
+
+ struct_p = bpf_tld_lookup(&tld, test_basic_value4, sizeof(struct test_struct));
+ if (struct_p)
+ test_value4 = *struct_p;
+
+ return 0;
+}
+
+char _license[] SEC("license") = "GPL";
+
diff --git a/tools/testing/selftests/bpf/task_local_data_common.h b/tools/testing/selftests/bpf/task_local_data_common.h
index 2a0bb724c77c..ad99c66d3305 100644
--- a/tools/testing/selftests/bpf/task_local_data_common.h
+++ b/tools/testing/selftests/bpf/task_local_data_common.h
@@ -38,4 +38,12 @@ struct task_local_data_map_value {
short udata_start;
};
+/* test specific */
+struct test_struct {
+ unsigned long a;
+ unsigned long b;
+ unsigned long c;
+ unsigned long d;
+};
+
#endif
--
2.47.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH RFC v3 2/2] selftests/bpf: Test basic workflow of task local data
2025-04-25 21:40 ` [PATCH RFC v3 2/2] selftests/bpf: Test basic workflow of " Amery Hung
@ 2025-04-25 22:12 ` Tejun Heo
2025-04-25 22:51 ` Amery Hung
0 siblings, 1 reply; 18+ messages in thread
From: Tejun Heo @ 2025-04-25 22:12 UTC (permalink / raw)
To: Amery Hung
Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, martin.lau,
kernel-team
Hello,
On Fri, Apr 25, 2025 at 02:40:34PM -0700, Amery Hung wrote:
...
> +bpf_tld_key_type_var("test_basic_value3", int, value3);
> +bpf_tld_key_type_var("test_basic_value4", struct test_struct, value4);
I think it'd be fine to always require key string.
> diff --git a/tools/testing/selftests/bpf/progs/test_task_local_data_basic.c b/tools/testing/selftests/bpf/progs/test_task_local_data_basic.c
> new file mode 100644
> index 000000000000..345d7c6e37de
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/test_task_local_data_basic.c
...
> + bpf_tld_init_var(&tld, test_basic_value3);
> + bpf_tld_init_var(&tld, test_basic_value4);
Would it make more sense to make the second parameter to be a string? The
key names may contain tokens that are special to C and it becomes odd to
escape naked strings.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC v3 2/2] selftests/bpf: Test basic workflow of task local data
2025-04-25 22:12 ` Tejun Heo
@ 2025-04-25 22:51 ` Amery Hung
0 siblings, 0 replies; 18+ messages in thread
From: Amery Hung @ 2025-04-25 22:51 UTC (permalink / raw)
To: Tejun Heo
Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, martin.lau,
kernel-team
On Fri, Apr 25, 2025 at 3:12 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Fri, Apr 25, 2025 at 02:40:34PM -0700, Amery Hung wrote:
> ...
> > +bpf_tld_key_type_var("test_basic_value3", int, value3);
> > +bpf_tld_key_type_var("test_basic_value4", struct test_struct, value4);
>
> I think it'd be fine to always require key string.
>
> > diff --git a/tools/testing/selftests/bpf/progs/test_task_local_data_basic.c b/tools/testing/selftests/bpf/progs/test_task_local_data_basic.c
> > new file mode 100644
> > index 000000000000..345d7c6e37de
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/test_task_local_data_basic.c
> ...
> > + bpf_tld_init_var(&tld, test_basic_value3);
> > + bpf_tld_init_var(&tld, test_basic_value4);
>
> Would it make more sense to make the second parameter to be a string? The
> key names may contain tokens that are special to C and it becomes odd to
> escape naked strings.
>
Totally makes sense. I will only keep bpf_tld_key_type_var() that
requires a key string in the declaration.
I will also add the key string as the third argument to
bpf_tld_init_var() and rename it to bpf_tld_fetch_key(). I don't think
the symbol type argument that refers to one of the members of struct
task_local_data_offsets can be removed.
So it becomes:
bpf_tld_fetch_key(&tld, test_basic_value3, "something.test_basic_value3");
Later, bpf programs still do:
bpf_tld_lookup(&tls, test_basic_value3);
> Thanks.
>
> --
> tejun
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC v3 1/2] selftests/bpf: Introduce task local data
2025-04-25 21:40 ` [PATCH RFC v3 1/2] selftests/bpf: Introduce task local data Amery Hung
@ 2025-04-30 1:44 ` Alexei Starovoitov
2025-05-02 15:00 ` Amery Hung
2025-05-01 20:38 ` Andrii Nakryiko
1 sibling, 1 reply; 18+ messages in thread
From: Alexei Starovoitov @ 2025-04-30 1:44 UTC (permalink / raw)
To: Amery Hung
Cc: bpf, Network Development, Andrii Nakryiko, Daniel Borkmann,
Tejun Heo, Martin KaFai Lau, Kernel Team
On Fri, Apr 25, 2025 at 2:40 PM Amery Hung <ameryhung@gmail.com> wrote:
>
> Task local data provides simple and fast bpf and user space APIs to
> exchange per-task data through task local storage map. The user space
> first declares task local data using bpf_tld_type_key_var() or
> bpf_tld_type_var(). The data is a thread-specific variable which
> every thread has its own copy. Then, a bpf_tld_thread_init() needs to
> be called for every thread to share the data with bpf. Finally, users
> can directly read/write thread local data without bpf syscall.
>
> For bpf programs to see task local data, every data need to be
> initialized once for every new task using bpf_tld_init_var(). Then
> bpf programs can retrieve pointers to the data with bpf_tld_lookup().
>
> The core of task local storage implementation relies on UPTRs. They
> pin user pages to the kernel so that user space can share data with bpf
> programs without syscalls. Both data and the metadata used to access
> data are pinned via UPTRs.
>
> A limitation of UPTR makes the implementation of task local data
> less trivial than it sounds: memory pinned to UPTR cannot exceed a
> page and must not cross the page boundary. In addition, the data
> declaration uses __thread identifier and therefore does not have
> directly control over the memory allocation. Therefore, several
> tricks and checks are used to make it work.
>
> First, task local data declaration APIs place data in a custom "udata"
> section so that data from different compilation units will be contiguous
> in the memory and can be pinned using two UPTRs if they are smaller than
> one page.
>
> To avoid each data from spanning across two pages, they are each aligned
> to the smallest power of two larget than their sizes.
>
> As we don't control the memory allocation for data, we need to figure
> out the layout of user defined data. This is done by the data
> declaration API and bpf_tld_thread_init(). The data declaration API
> will insert constructors for all data, and they are used to find the
> size and offset of data as well as the beginning and end of the whole
> udata section. Then, bpf_tld_thread_init() performs a per-thread check
> to make sure no data will cross the page boundary as udata can start at
> different offset in a page.
>
> Note that for umetadata, we directly aligned_alloc() memory for it and
> assigned to the UPTR. This is only done once for every process as every
> tasks shares the same umetadata. The actual thread-specific data offset
> will be adjusted in the bpf program when calling bpf_tld_init_var().
>
> Signed-off-by: Amery Hung <ameryhung@gmail.com>
> ---
> .../bpf/prog_tests/task_local_data.c | 159 +++++++++++++++
> .../bpf/prog_tests/task_local_data.h | 58 ++++++
> .../selftests/bpf/progs/task_local_data.h | 181 ++++++++++++++++++
> .../selftests/bpf/task_local_data_common.h | 41 ++++
> 4 files changed, 439 insertions(+)
> create mode 100644 tools/testing/selftests/bpf/prog_tests/task_local_data.c
> create mode 100644 tools/testing/selftests/bpf/prog_tests/task_local_data.h
> create mode 100644 tools/testing/selftests/bpf/progs/task_local_data.h
> create mode 100644 tools/testing/selftests/bpf/task_local_data_common.h
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/task_local_data.c b/tools/testing/selftests/bpf/prog_tests/task_local_data.c
> new file mode 100644
> index 000000000000..5a21514573d2
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/task_local_data.c
> @@ -0,0 +1,159 @@
> +#include <fcntl.h>
> +#include <errno.h>
> +#include <stdio.h>
> +#include <pthread.h>
> +
> +#include <bpf/bpf.h>
> +
> +#include "bpf_util.h"
> +#include "task_local_data.h"
> +#include "task_local_storage_helpers.h"
> +
> +#define PIDFD_THREAD O_EXCL
> +
> +/* To find the start of udata for each thread, insert a dummy variable to udata.
Pls use kernel comment style instead of networking.
> + * Contructors generated for every task local data will figured out the offset
constructors.
> + * from the beginning of udata to the dummy symbol. Then, every thread can infer
> + * the start of udata by subtracting the offset from the address of dummy.
> + */
Pls add the full algorithm involving udata_dummy here.
How it can be anywhere in the udata section...
then constructors will keep adjusting udata_start_dummy_off
and eventually km->off -= udata_start_dummy_off.
These steps are very tricky.
So big and detailed comments are necessary.
> +static __thread struct udata_dummy {} udata_dummy SEC("udata");
> +
> +static __thread bool task_local_data_thread_inited;
> +
> +struct task_local_data {
> + void *udata_start;
> + void *udata_end;
> + int udata_start_dummy_off;
> + struct meta_page *umetadata;
> + int umetadata_cnt;
> + bool umetadata_init;
> + short udata_sizes[64];
> + pthread_mutex_t lock;
> +} task_local_data = {
> + .udata_start = (void *)-1UL,
> + .lock = PTHREAD_MUTEX_INITIALIZER,
> +};
> +
> +static void tld_set_data_key_meta(int i, const char *key, short off)
> +{
> + task_local_data.umetadata->meta[i].off = off;
> + strncpy(task_local_data.umetadata->meta[i].key, key, TASK_LOCAL_DATA_KEY_LEN);
> +}
> +
> +static struct key_meta *tld_get_data_key_meta(int i)
> +{
> + return &task_local_data.umetadata->meta[i];
> +}
> +
> +static void tld_set_data_size(int i, int size)
> +{
> + task_local_data.udata_sizes[i] = size;
> +}
> +
> +static int tld_get_data_size(int i)
> +{
> + return task_local_data.udata_sizes[i];
> +}
The above 4 helpers are single use.
If nothing else is using them, open code them directly.
Otherwise it only makes it harder to understand the logic.
> +
> +void __bpf_tld_var_init(const char *key, void *var, int size)
> +{
> + int i;
> +
> + i = task_local_data.umetadata_cnt++;
> +
> + if (!task_local_data.umetadata) {
> + if (task_local_data.umetadata_cnt > 1)
> + return;
> +
> + task_local_data.umetadata = aligned_alloc(PAGE_SIZE, PAGE_SIZE);
> + if (!task_local_data.umetadata)
> + return;
> + }
> +
> + if (var < task_local_data.udata_start) {
> + task_local_data.udata_start = var;
> + task_local_data.udata_start_dummy_off =
> + (void *)&udata_dummy - task_local_data.udata_start;
> + }
> +
> + if (var + size > task_local_data.udata_end)
> + task_local_data.udata_end = var + size;
> +
> + tld_set_data_key_meta(i, key, var - (void *)&udata_dummy);
> + tld_set_data_size(i, size);
> +}
> +
> +int bpf_tld_thread_init(void)
> +{
> + unsigned long udata_size, udata_start, udata_start_page, udata_end_page;
> + struct task_local_data_map_value map_val;
> + int i, task_id, task_fd, map_fd, err;
> +
> + if (!task_local_data.umetadata_cnt || task_local_data_thread_inited)
> + return 0;
> +
> + if (task_local_data.umetadata_cnt && !task_local_data.umetadata)
> + return -ENOMEM;
> +
> + udata_start = (unsigned long)&udata_dummy + task_local_data.udata_start_dummy_off;
> +
> + pthread_mutex_lock(&task_local_data.lock);
can we drop this?
If .c is part of .h just this line will drag -lpthread dependency.
I think it's an artifact on the selftest.
The selftest/library/application can have its own mutex to protect
or this function can use simple xchg() like exclusion
without mutexes.
> + for (i = 0; i < task_local_data.umetadata_cnt; i++) {
> + struct key_meta *km = tld_get_data_key_meta(i);
> + int size = tld_get_data_size(i);
> + int off;
> +
> + if (!task_local_data.umetadata_init) {
> + /* Constructors save the offset from udata_dummy to each data
> + * Now as all ctors have run and the offset between the start of
> + * udata and udata_dummy is known, adjust the offsets of data
> + * to be relative to the start of udata
> + */
> + km->off -= task_local_data.udata_start_dummy_off;
> +
> + /* Data exceeding a page may not be able to be covered by
> + * two udata UPTRs in every thread
> + */
> + if (km->off >= PAGE_SIZE)
> + return -EOPNOTSUPP;
returns without releasing the mutex...
One more reason to avoid it.
> + }
> +
> + /* A task local data should not span across two pages. */
> + off = km->off + udata_start;
> + if ((off & PAGE_MASK) != ((off + size - 1) & PAGE_MASK))
> + return -EOPNOTSUPP;
> + }
> + task_local_data.umetadata_init = true;
> + pthread_mutex_unlock(&task_local_data.lock);
> +
> + udata_size = task_local_data.udata_end - task_local_data.udata_start;
> + udata_start_page = udata_start & PAGE_MASK;
> + udata_end_page = (udata_start + udata_size) & PAGE_MASK;
> +
> + /* The whole udata can span across two pages for a thread. Use two UPTRs
> + * to cover the second page in case it happens.
> + */
> + map_val.udata_start = udata_start & ~PAGE_MASK;
> + map_val.udata[0].page = (struct data_page *)(udata_start_page);
> + map_val.udata[1].page = (udata_start_page == udata_end_page) ? NULL :
> + (struct data_page *)(udata_start_page + PAGE_SIZE);
> +
> + /* umetadata is shared by all threads under the assumption that all
> + * task local data are defined statically and linked together
> + */
> + map_val.umetadata = task_local_data.umetadata;
> + map_val.umetadata_cnt = task_local_data.umetadata_cnt;
> +
> + map_fd = bpf_obj_get(TASK_LOCAL_DATA_MAP_PIN_PATH);
> + if (map_fd < 0)
> + return -errno;
> +
> + task_id = sys_gettid();
> + task_fd = sys_pidfd_open(task_id, PIDFD_THREAD);
> + err = bpf_map_update_elem(map_fd, &task_fd, &map_val, 0);
> + if (err)
> + return err;
> +
> + task_local_data_thread_inited = true;
> + return 0;
> +}
> diff --git a/tools/testing/selftests/bpf/prog_tests/task_local_data.h b/tools/testing/selftests/bpf/prog_tests/task_local_data.h
> new file mode 100644
> index 000000000000..c928e8d2c0a6
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/task_local_data.h
> @@ -0,0 +1,58 @@
> +#ifndef __BPF_TASK_LOCAL_DATA_H__
> +#define __BPF_TASK_LOCAL_DATA_H__
> +
> +#include "task_local_data_common.h"
> +
> +#define SEC(name) __attribute__((section(name), used))
> +#define __aligned(x) __attribute__((aligned(x)))
> +
> +#define ROUND_UP_POWER_OF_TWO(x) (1UL << (sizeof(x) * 8 - __builtin_clzl(x - 1)))
> +
> +void __bpf_tld_var_init(const char *key, void *var, int size);
If possible, let's try to put everything into .h, so it's easier
to distribute. Otherwise extra .c becomes a headache.
> +
> +/**
> + * @brief bpf_tld_key_type_var() declares a task local data shared with bpf
> + * programs. The data will be a thread-specific variable which a user space
> + * program can directly read/write, while bpf programs will need to lookup
> + * with the string key.
> + *
> + * @param key The string key a task local data will be associated with. The
> + * string will be truncated if the length exceeds TASK_LOCAL_DATA_KEY_LEN
> + * @param type The type of the task local data
> + * @param var The name of the task local data
> + */
> +#define bpf_tld_key_type_var(key, type, var) \
> +__thread type var SEC("udata") __aligned(ROUND_UP_POWER_OF_TWO(sizeof(type))); \
> + \
> +__attribute__((constructor)) \
> +void __bpf_tld_##var##_init(void) \
> +{ \
> + _Static_assert(sizeof(type) < PAGE_SIZE, \
> + "data size must not exceed a page"); \
> + __bpf_tld_var_init(key, &var, sizeof(type)); \
> +}
> +
> +/**
> + * @brief bpf_tld_key_type_var() declares a task local data shared with bpf
> + * programs. The data will be a thread-specific variable which a user space
> + * program can directly read/write, while bpf programs will need to lookup
> + * the data with the string key same as the variable name.
> + *
> + * @param type The type of the task local data
> + * @param var The name of the task local data as well as the name of the
> + * key. The key string will be truncated if the length exceeds
> + * TASK_LOCAL_DATA_KEY_LEN.
> + */
> +#define bpf_tld_type_var(type, var) \
> + bpf_tld_key_type_var(#var, type, var)
Hiding string obfuscates it too much.
This API doesn't have analogous APIs either in bpf or user space.
So let's make everything explicit.
In this case bpf_tld_key_type_var()-like should be the only api
to declare a variable.
I would call it bpf_tld_var().
> +
> +/**
> + * @brief bpf_tld_thread_init() initializes the task local data for the current
> + * thread. All data are undefined from a bpf program's point of view until
> + * bpf_tld_thread_init() is called.
> + *
> + * @return 0 on success; negative error number on failure
> + */
> +int bpf_tld_thread_init(void);
> +
> +#endif
> diff --git a/tools/testing/selftests/bpf/progs/task_local_data.h b/tools/testing/selftests/bpf/progs/task_local_data.h
> new file mode 100644
> index 000000000000..7358993ee634
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/task_local_data.h
> @@ -0,0 +1,181 @@
> +#include <errno.h>
> +#include <bpf/bpf_helpers.h>
> +
> +#include "task_local_data_common.h"
> +
> +#define PAGE_IDX_MASK 0x8000
> +
> +/* Overview
> + *
> + * Task local data (TLD) allows sharing per-task information between users and
> + * bpf programs without explicit storage layout managenent. TLD APIs use to
> + * string keys to access data. Internally, TLD shares user pages throguh two
> + * UPTRs in a task local storage: udata and umetadata. Data are stored in udata.
> + * Keys and the offsets of the corresponding data in udata are stored in umetadata.
> + *
> + * Usage
> + *
> + * Users should initialize every task local data once for every new task before
> + * using them with bpf_tld_init_var(). It should be done ideally in non-critical
> + * paths first (e.g., sched_ext_ops::init_task) as it compare key strings and
> + * cache the offsets of data.
> + *
> + * First, user should define struct task_local_data_offsets, which will be used
> + * to cache the offsets of task local data. Each member of the struct should
> + * be a short integer with name same as the key name defined in the user space.
> + * Another task local storage map will be created to save the offsets. For example:
> + *
> + * struct task_local_data_offsets {
> + * short priority;
> + * short in_critical_section;
> + * };
The use of 'short' is unusual.
The kernel and bpf progs always use either u16 or s16.
> + *
> + * Task local data APIs take a pointer to bpf_task_local_data object as the first
> + * argument. The object should be declared as a stack variable and initialized via
> + * bpf_tld_init(). Then, in a bpf program, to cache the offset for a key-value pair,
> + * call bpf_tld_init_var(). For example, in init_task program:
> + *
> + * struct bpf_task_local_data tld;
> + *
> + * err = bpf_tld_init(task, &tld);
> + * if (err)
> + * return 0;
> + *
> + * bpf_tld_init_var(&tld, priority);
> + * bpf_tld_init_var(&tld, in_critical_section);
> + *
> + * Subsequently and in other bpf programs, to lookup task local data, call
> + * bpf_tld_lookup(). For example:
> + *
> + * int *p;
> + *
> + * p = bpf_tld_lookup(&tld, priority, sizeof(int));
> + * if (p)
> + * // do something depending on *p
> + */
> +
> +struct task_local_data_offsets;
> +
> +struct {
> + __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
> + __uint(map_flags, BPF_F_NO_PREALLOC);
> + __type(key, int);
> + __type(value, struct task_local_data_map_value);
> + __uint(pinning, LIBBPF_PIN_BY_NAME);
> +} task_local_data_map SEC(".maps");
> +
> +struct {
> + __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
> + __uint(map_flags, BPF_F_NO_PREALLOC);
> + __type(key, int);
> + __type(value, struct task_local_data_offsets);
> +} task_local_data_off_map SEC(".maps");
> +
> +struct bpf_task_local_data {
> + struct task_local_data_map_value *data_map;
> + struct task_local_data_offsets *off_map;
> +};
> +
> +/**
> + * @brief bpf_tld_init() initializes a bpf_task_local_data object.
> + *
> + * @param task The task_struct of the target task
> + * @param tld A pointer to a bpf_task_local_data object to be initialized
> + * @return -EINVAL if task KV store is not initialized by the user space for this task;
> + * -ENOMEM if cached offset map creation fails. In both cases, users can abort, or
> + * conitnue without calling task KV store APIs as if no key-value pairs are set.
continue
> + */
> +__attribute__((unused))
> +static int bpf_tld_init(struct task_struct *task, struct bpf_task_local_data *tld)
> +{
> + tld->data_map = bpf_task_storage_get(&task_local_data_map, task, 0, 0);
> + if (!tld->data_map)
> + return -EINVAL;
> +
> + tld->off_map = bpf_task_storage_get(&task_local_data_off_map, task, 0,
> + BPF_LOCAL_STORAGE_GET_F_CREATE);
> + if (!tld->off_map)
> + return -ENOMEM;
> +
> + return 0;
> +}
> +
> +/**
> + * @brief bpf_tld_init_var() lookups the metadata with a key and caches the offset of
> + * the value corresponding to the key.
> + *
> + * @param tld A pointer to a valid bpf_task_local_data object initialized by bpf_tld_init()
> + * @param key The key used to lookup the task KV store. Should be one of the
> + * symbols defined in struct task_local_data_offsets, not a string
> + */
> +#define bpf_tld_init_var(tld, key) \
> + ({ \
> + (tld)->off_map->key = bpf_tld_fetch_off(tld, #key); \
> + })
> +
> +__attribute__((unused))
> +static short bpf_tld_fetch_off(struct bpf_task_local_data *tld, const char *key)
> +{
> + int i, umetadata_off, umetadata_cnt, udata_start;
> + void *umetadata, *key_i, *off_i;
> + short off = 0;
> +
> + if (!tld->data_map || !tld->data_map->umetadata)
> + goto out;
> +
> + udata_start = tld->data_map->udata_start;
> + umetadata_cnt = tld->data_map->umetadata_cnt;
> + umetadata = tld->data_map->umetadata->meta;
> +
> + bpf_for(i, 0, umetadata_cnt) {
> + umetadata_off = i * sizeof(struct key_meta);
> + if (umetadata_off > PAGE_SIZE - sizeof(struct key_meta))
> + break;
> +
> + key_i = umetadata + umetadata_off + offsetof(struct key_meta, key);
> + off_i = umetadata + umetadata_off + offsetof(struct key_meta, off);
> +
> + if (!bpf_strncmp(key_i, TASK_LOCAL_DATA_KEY_LEN, key)) {
> + off = *(short *)(off_i) + udata_start;
> + if (off >= PAGE_SIZE)
> + off = (off - PAGE_SIZE) | PAGE_IDX_MASK;
> + /* Shift cached offset by 1 so that 0 means not initialized */
> + off += 1;
> + break;
> + }
> + }
> +out:
> + return off;
> +}
> +
> +/**
> + * @brief bpf_tld_lookup() lookups the task KV store using the cached offset
> + * corresponding to the key.
> + *
> + * @param tld A pointer to a valid bpf_task_local_data object initialized by bpf_tld_init()
> + * @param key The key used to lookup the task KV store. Should be one of the
> + * symbols defined in struct task_local_data_offsets, not a string
> + * @param size The size of the value. Must be a known constant value
> + * @return A pointer to the value corresponding to the key; NULL if the offset
> + * if not cached or the size is too big
> + */
> +#define bpf_tld_lookup(tld, key, size) __bpf_tld_lookup(tld, (tld)->off_map->key, size)
> +__attribute__((unused))
> +static void *__bpf_tld_lookup(struct bpf_task_local_data *tld, short cached_off, int size)
> +{
> + short page_off, page_idx;
> +
> + if (!cached_off--)
> + return NULL;
> +
> + page_off = cached_off & ~PAGE_IDX_MASK;
> + page_idx = !!(cached_off & PAGE_IDX_MASK);
> +
> + if (page_idx) {
> + return (tld->data_map->udata[1].page && page_off < PAGE_SIZE - size) ?
> + (void *)tld->data_map->udata[1].page + page_off : NULL;
> + } else {
> + return (tld->data_map->udata[0].page && page_off < PAGE_SIZE - size) ?
> + (void *)tld->data_map->udata[0].page + page_off : NULL;
> + }
> +}
> diff --git a/tools/testing/selftests/bpf/task_local_data_common.h b/tools/testing/selftests/bpf/task_local_data_common.h
> new file mode 100644
> index 000000000000..2a0bb724c77c
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/task_local_data_common.h
> @@ -0,0 +1,41 @@
> +#ifndef __BPF_TASK_KV_STORE_COMMON_H__
> +#define __BPF_TASK_KV_STORE_COMMON_H__
> +
> +#ifdef __BPF__
> +struct data_page *dummy_data_page;
> +struct meta_page *dummy_meta_page;
> +#else
> +#define __uptr
> +#endif
> +
> +
> +#define TASK_LOCAL_DATA_MAP_PIN_PATH "/sys/fs/bpf/task_local_data_map"
> +#define TASK_LOCAL_DATA_KEY_LEN 62
> +#define PAGE_SIZE 4096
We have
enum page_size_enum {
__PAGE_SIZE = PAGE_SIZE
};
inside kernel/bpf/core.c
and bpf progs that include vmlinux.h can use it directly as __PAGE_SIZE.
Let's think through upfront how the whole thing works on
architectures with different page size.
> +#define PAGE_MASK (~(PAGE_SIZE - 1))
> +
> +struct data_page {
> + char data[PAGE_SIZE];
> +};
> +
> +struct data_page_entry {
> + struct data_page __uptr *page;
> +};
> +
> +struct key_meta {
> + char key[TASK_LOCAL_DATA_KEY_LEN];
> + short off;
> +};
> +
> +struct meta_page {
> + struct key_meta meta[64];
> +};
> +
> +struct task_local_data_map_value {
> + struct data_page_entry udata[2];
> + struct meta_page __uptr *umetadata;
> + short umetadata_cnt;
> + short udata_start;
> +};
> +
> +#endif
> --
> 2.47.1
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC v3 0/2] Task local data API
2025-04-25 21:40 [PATCH RFC v3 0/2] Task local data API Amery Hung
2025-04-25 21:40 ` [PATCH RFC v3 1/2] selftests/bpf: Introduce task local data Amery Hung
2025-04-25 21:40 ` [PATCH RFC v3 2/2] selftests/bpf: Test basic workflow of " Amery Hung
@ 2025-05-01 20:37 ` Andrii Nakryiko
2025-05-01 23:26 ` Alexei Starovoitov
2025-05-02 4:26 ` Amery Hung
2 siblings, 2 replies; 18+ messages in thread
From: Andrii Nakryiko @ 2025-05-01 20:37 UTC (permalink / raw)
To: Amery Hung
Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, tj, martin.lau,
kernel-team
On Fri, Apr 25, 2025 at 2:40 PM Amery Hung <ameryhung@gmail.com> wrote:
>
> Hi,
>
> This a respin of uptr KV store. It is renamed to task local data (TLD)
> as the problem statement and the solution have changed, and it now draws
> more similarities to pthread thread-specific data.
>
> * Overview *
>
> This patchset is a continuation of the original UPTR work[0], which aims
> to provide a fast way for user space programs to pass per-task hints to
> sched_ext schedulers. UPTR built the foundation by supporting sharing
> user pages with bpf programs through task local storage maps.
>
> Additionally, sched_ext would like to allow multiple developers to share
> a storage without the need to explicitly agreeing on the layout of it.
> This simplify code base management and makes experimenting easier.
> While a centralized storage layout definition would have worked, the
> friction of synchronizing it across different repos is not desirable.
>
> This patchset contains the user space plumbing so that user space and bpf
> program developers can exchange per-task hints easily through simple
> interfaces.
>
> * Design *
>
> BPF task local data is a simple API for sharing task-specific data
> between user space and bpf programs, where data are refered to using
> string keys. As shown in the following figure, user space programs can
> define a task local data using bpf_tld_type_var(). The data is
> effectively a variable declared with __thread, which every thread owns an
> independent copy and can be directly accessed. On the bpf side, a task
> local data first needs to be initialized for every new task once (e.g.,
> in sched_ext_ops::init_task) using bpf_tld_init_var(). Then, other bpf
> programs can get a pointer to the data using bpf_tld_lookup(). The task
> local data APIs refer to data using string keys so developers
> does not need to deal with addresses of data in a shared storage.
>
> ┌─ Application ─────────────────────────────────────────┐
> │ ┌─ library A ──────────────┐ │
> │ bpf_tld_type_var(int, X) │ bpf_tld_type_var(int, Y) │ │
> │ └┬─────────────────────────┘ │
> └───────┬───────────────────│───────────────────────────┘
> │ X = 123; │ Y = true;
bpf_tld_type_var() is *defining* variable (i.e., it allocates storage
for it), right? I think for completeness we need also *declare* macro
to access that variable from other compilation units
> V V
> + ─ Task local data ─ ─ ─ ─ ─ ─ +
> | ┌─ task_kvs_map ────────────┐ | ┌─ sched_ext_ops::init_task ──────┐
> | │ BPF Task local storage │ | │ bpf_tld_init_var(&kvs, X); │
> | │ ┌───────────────────┐ │ |<─┤ bpf_tld_init_var(&kvs, Y); │
> | │ │ __uptr *udata │ │ | └─────────────────────────────────┘
> | │ └───────────────────┘ │ |
> | │ ┌───────────────────┐ │ | ┌─ Other sched_ext_ops op ────────┐
> | │ │ __uptr *umetadata │ │ | │ int *y; ├┐
> | │ └───────────────────┘ │ |<─┤ y = bpf_tld_lookup(&kvs, Y, 1); ││
> | └───────────────────────────┘ | │ if (y) ││
> | ┌─ task_kvs_off_map ────────┐ | │ /* do something */ ││
> | │ BPF Task local storage │ | └┬────────────────────────────────┘│
> | └───────────────────────────┘ | └─────────────────────────────────┘
> + ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ +
>
> * Implementation *
>
> Task local data API hides the memory management from the developers.
> Internally, it shares user data with bpf programs through udata UPTRs.
> Task local data from different compilation units are placed into a
> custom "udata" section by the declaration API, bpf_tld_type_var(), so
> that they are placed together in the memory. User space will need to
> call bpf_tld_thread_init() for every new thread to pin udata pages to
> kernel.
>
> The metadata used to address udata is stored in umetadata UPTR. It is
> generated by constructors inserted by bpf_tld_type_var() and
> bpf_tld_thread_init(). umetadata is an array of 64 metadata corresponding
> to each data, which contains the key and the offset of data in udata.
> During initialization, bpf_tld_init_var() will search umetadata for
> a matching key and cache its offset in task_kvs_off_map. Later,
> bpf_tld_lookup() will use the cached offset to retreive a pointer to
> udata.
>
> * Limitation *
>
> Currently, it is assumed all key-value pairs are known as a program
> starts. All compilation units using task local data should be statically
> linked together so that values are all placed together in a udata section
> and therefore can be shared with bpf through two UPTRs. The next
Lack of support for shared libraries is a big limitation, IMO, so I
think we should design for that support from the very beginning.
FWIW, I think this compile-time static __thread variable definitions
are unnecessarily limiting and add a non-trivial amount of complexity.
I think it's more reasonable to go with a more dynamic approach, and
I'll try to outline what an API (and some implementation details)
might look like.
Note, all this is related to the user-space side of things. BPF-side
is unchanged, effectively, except you get a guarantee that all the
data will definitely be page-aligned, so you won't need to do these
two uptr pages handling.
First, data structures. You'll have one per-process metadata structure
describing known keys and their assigned "offsets":
struct tld_metadata {
int cnt;
char keys[MAX_KEY_CNT];
__u16 offs[MAX_KEY_CNT];
__u16 szs[MAX_KEY_CNT];
};
Now, each thread will basically have just a blob of data, so,
technically, per-thread we will just have:
struct tld_data {
__u8 data[PAGE_SIZE];
};
By pre-allocating the entire page we avoid lots of complications, so I
think it's worth doing.
Now, we really need just two APIs on user-space (and I'll use the
"opaque offset" type as I suggested on another patch):
typedef struct { int off; } tld_off_t;
tld_off_t tld_add_key_def(const char *key_name, size_t key_size);
This API can be called just once per each key that process cares
about. And this can be done at any point, really, very dynamically.
The implementation will:
- (just once per process) open pinned BPF map, remember its FD;
- (just once) allocate struct tld_metadata, unless we define it as
pre-allocated global variable;
- (locklessly) check if key_name is already in tld_metadata, if yes
- return already assigned offset;
- (locklessly) if not, add this key and assign it offset that is
offs[cnt - 1] + szs[cnt - 1] (i.e., we just tightly pack all the
values (though we should take care of alignment requirements, of
course);
- return newly assigned offset;
Now, the second essential API is called for each participating thread
for each different key. And again, this is all very dynamic. It's
possible that some threads won't use any of this TLD stuff, in which
case there will be no overhead (memory or CPU), and not even an entry
in task local storage map for that thread. So, API:
void *tld_resolve_key(tld_off_t key_off);
This API will:
- (just once per thread, which is done trivially by using
__thread-local global variable to keep track of this) allocate struct
tld_data dynamically (with page alignment, alloc_aligned(PAGE_SIZE,
PAGE_SIZE))
- (just once per thread as well) bpf_map_update_elem() for current
thread, updating two uptrs: one pointing to global tld_metadata,
another pointing to thread-local tld_data;
- return tld_data->data + key_off.off;
That is, this API returns an absolute memory address of a value
resolved in the context of the current thread.
And let's look at how one can make use of this on the user-space side
to optimally use this API.
/* global variables */
tld_off_t my_key1_off; /* struct my_val */
tld_off_t my_key2_off; /* int */
__thread struct my_val *my_val1;
__thread int *my_val2;
... somewhere in constructor ...
my_key1_off = tld_add_key_def("my_key1", sizeof(struct my_val));
my_key2_off = tld_add_key_def("my_key2", sizeof(int));
... and then somewhere in the code that makes use of TLD stuff ...
if (!my_val1) /* this can be initialized once per thread to avoid this
check (or hidden in a helper accessor function) */
my_val1 = tld_resolve(my_key1_off);
my_val1->blah_field = 123;
if (!my_val2)
my_val2 = tld_resolve(my_key2_off);
*my_val2 = 567;
That's pretty much it, I think.
In C++ code base, it should be possible to make this even more
convenient by using a templated wrapper with thread-local inner
variable with its own constructor. Adding operator overloading (e.g.,
operator= and operator->) you get a very naturally looking definition
and access patterns:
/* global data */
tld_variable<struct my_val> my_val1("my_key1");
tld_variable<int> my_val2("my_key2");
... now in the actual TLD-using code, we just do:
my_val1->blah_field = 123;
my_val2 = 567; /* I think C++ would allow this with operator overloading */
I hope the example explains why it's still fast despite everything
being dynamic. There is a pointer indirection and that page-sized
allocation (so that we can cache thread-specific resolved pointers),
but it seems absolutely fine from performance perspective.
> iteration will explore how bpf task local data can work in dynamic
> libraries. Maybe more udata UPTRs will be added to pin page of TLS
> of dynamically loaded modules. Or maybe it will allocate memory for data
> instead of relying on __thread, and change how user space interact with
> task local data slightly. The later approach can also save some troubles
> dealing with the restriction of UPTR.
>
> Some other limitations:
> - Total task local data cannot exceed a page
> - Only support 64 task local data
> - Some memory waste for data whose size is not power of two
> due to UPTR limitation
>
> [0] https://lore.kernel.org/bpf/20241023234759.860539-1-martin.lau@linux.dev/
>
>
> Amery Hung (2):
> selftests/bpf: Introduce task local data
> selftests/bpf: Test basic workflow of task local data
>
> .../bpf/prog_tests/task_local_data.c | 159 +++++++++++++++
> .../bpf/prog_tests/task_local_data.h | 58 ++++++
> .../bpf/prog_tests/test_task_local_data.c | 156 +++++++++++++++
> .../selftests/bpf/progs/task_local_data.h | 181 ++++++++++++++++++
> .../bpf/progs/test_task_local_data_basic.c | 78 ++++++++
> .../selftests/bpf/task_local_data_common.h | 49 +++++
> 6 files changed, 681 insertions(+)
> create mode 100644 tools/testing/selftests/bpf/prog_tests/task_local_data.c
> create mode 100644 tools/testing/selftests/bpf/prog_tests/task_local_data.h
> create mode 100644 tools/testing/selftests/bpf/prog_tests/test_task_local_data.c
> create mode 100644 tools/testing/selftests/bpf/progs/task_local_data.h
> create mode 100644 tools/testing/selftests/bpf/progs/test_task_local_data_basic.c
> create mode 100644 tools/testing/selftests/bpf/task_local_data_common.h
>
> --
> 2.47.1
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC v3 1/2] selftests/bpf: Introduce task local data
2025-04-25 21:40 ` [PATCH RFC v3 1/2] selftests/bpf: Introduce task local data Amery Hung
2025-04-30 1:44 ` Alexei Starovoitov
@ 2025-05-01 20:38 ` Andrii Nakryiko
1 sibling, 0 replies; 18+ messages in thread
From: Andrii Nakryiko @ 2025-05-01 20:38 UTC (permalink / raw)
To: Amery Hung
Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, tj, martin.lau,
kernel-team
On Fri, Apr 25, 2025 at 2:40 PM Amery Hung <ameryhung@gmail.com> wrote:
>
> Task local data provides simple and fast bpf and user space APIs to
> exchange per-task data through task local storage map. The user space
> first declares task local data using bpf_tld_type_key_var() or
> bpf_tld_type_var(). The data is a thread-specific variable which
> every thread has its own copy. Then, a bpf_tld_thread_init() needs to
> be called for every thread to share the data with bpf. Finally, users
> can directly read/write thread local data without bpf syscall.
>
> For bpf programs to see task local data, every data need to be
> initialized once for every new task using bpf_tld_init_var(). Then
> bpf programs can retrieve pointers to the data with bpf_tld_lookup().
>
> The core of task local storage implementation relies on UPTRs. They
> pin user pages to the kernel so that user space can share data with bpf
> programs without syscalls. Both data and the metadata used to access
> data are pinned via UPTRs.
>
> A limitation of UPTR makes the implementation of task local data
> less trivial than it sounds: memory pinned to UPTR cannot exceed a
> page and must not cross the page boundary. In addition, the data
> declaration uses __thread identifier and therefore does not have
> directly control over the memory allocation. Therefore, several
> tricks and checks are used to make it work.
>
> First, task local data declaration APIs place data in a custom "udata"
> section so that data from different compilation units will be contiguous
> in the memory and can be pinned using two UPTRs if they are smaller than
> one page.
>
> To avoid each data from spanning across two pages, they are each aligned
> to the smallest power of two larget than their sizes.
>
> As we don't control the memory allocation for data, we need to figure
> out the layout of user defined data. This is done by the data
> declaration API and bpf_tld_thread_init(). The data declaration API
> will insert constructors for all data, and they are used to find the
> size and offset of data as well as the beginning and end of the whole
> udata section. Then, bpf_tld_thread_init() performs a per-thread check
> to make sure no data will cross the page boundary as udata can start at
> different offset in a page.
>
> Note that for umetadata, we directly aligned_alloc() memory for it and
> assigned to the UPTR. This is only done once for every process as every
> tasks shares the same umetadata. The actual thread-specific data offset
> will be adjusted in the bpf program when calling bpf_tld_init_var().
>
> Signed-off-by: Amery Hung <ameryhung@gmail.com>
> ---
> .../bpf/prog_tests/task_local_data.c | 159 +++++++++++++++
> .../bpf/prog_tests/task_local_data.h | 58 ++++++
> .../selftests/bpf/progs/task_local_data.h | 181 ++++++++++++++++++
> .../selftests/bpf/task_local_data_common.h | 41 ++++
> 4 files changed, 439 insertions(+)
> create mode 100644 tools/testing/selftests/bpf/prog_tests/task_local_data.c
> create mode 100644 tools/testing/selftests/bpf/prog_tests/task_local_data.h
> create mode 100644 tools/testing/selftests/bpf/progs/task_local_data.h
> create mode 100644 tools/testing/selftests/bpf/task_local_data_common.h
>
[...]
> +/**
> + * @brief bpf_tld_key_type_var() declares a task local data shared with bpf
> + * programs. The data will be a thread-specific variable which a user space
> + * program can directly read/write, while bpf programs will need to lookup
> + * with the string key.
> + *
> + * @param key The string key a task local data will be associated with. The
> + * string will be truncated if the length exceeds TASK_LOCAL_DATA_KEY_LEN
> + * @param type The type of the task local data
> + * @param var The name of the task local data
> + */
> +#define bpf_tld_key_type_var(key, type, var) \
> +__thread type var SEC("udata") __aligned(ROUND_UP_POWER_OF_TWO(sizeof(type))); \
> + \
> +__attribute__((constructor)) \
> +void __bpf_tld_##var##_init(void) \
> +{ \
> + _Static_assert(sizeof(type) < PAGE_SIZE, \
> + "data size must not exceed a page"); \
> + __bpf_tld_var_init(key, &var, sizeof(type)); \
> +}
> +
> +/**
> + * @brief bpf_tld_key_type_var() declares a task local data shared with bpf
declares -> defines, declaration would involve extern and no storage
allocated (it's only to reference something *defined* in another
compilation unit)
> + * programs. The data will be a thread-specific variable which a user space
> + * program can directly read/write, while bpf programs will need to lookup
> + * the data with the string key same as the variable name.
> + *
> + * @param type The type of the task local data
> + * @param var The name of the task local data as well as the name of the
> + * key. The key string will be truncated if the length exceeds
> + * TASK_LOCAL_DATA_KEY_LEN.
> + */
> +#define bpf_tld_type_var(type, var) \
> + bpf_tld_key_type_var(#var, type, var)
> +
> +/**
> + * @brief bpf_tld_thread_init() initializes the task local data for the current
> + * thread. All data are undefined from a bpf program's point of view until
> + * bpf_tld_thread_init() is called.
> + *
> + * @return 0 on success; negative error number on failure
> + */
> +int bpf_tld_thread_init(void);
> +
> +#endif
> diff --git a/tools/testing/selftests/bpf/progs/task_local_data.h b/tools/testing/selftests/bpf/progs/task_local_data.h
> new file mode 100644
> index 000000000000..7358993ee634
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/task_local_data.h
given this is BPF-only header, I'd make it more explicit by using
.bpf.h suffix, just like libbpf's usdt.bpf.h
[...]
> +/**
> + * @brief bpf_tld_lookup() lookups the task KV store using the cached offset
> + * corresponding to the key.
> + *
> + * @param tld A pointer to a valid bpf_task_local_data object initialized by bpf_tld_init()
> + * @param key The key used to lookup the task KV store. Should be one of the
> + * symbols defined in struct task_local_data_offsets, not a string
> + * @param size The size of the value. Must be a known constant value
> + * @return A pointer to the value corresponding to the key; NULL if the offset
> + * if not cached or the size is too big
> + */
> +#define bpf_tld_lookup(tld, key, size) __bpf_tld_lookup(tld, (tld)->off_map->key, size)
> +__attribute__((unused))
> +static void *__bpf_tld_lookup(struct bpf_task_local_data *tld, short cached_off, int size)
I'd probably use some opaque type for these offsets to discourage user
application to try to create them, instead of strictly using offsets
returned from API. So:
typedef struct { int off; } tld_off_t;
and then just use tld_off_t everywhere?
> +{
> + short page_off, page_idx;
> +
> + if (!cached_off--)
> + return NULL;
> +
> + page_off = cached_off & ~PAGE_IDX_MASK;
> + page_idx = !!(cached_off & PAGE_IDX_MASK);
> +
> + if (page_idx) {
> + return (tld->data_map->udata[1].page && page_off < PAGE_SIZE - size) ?
> + (void *)tld->data_map->udata[1].page + page_off : NULL;
> + } else {
> + return (tld->data_map->udata[0].page && page_off < PAGE_SIZE - size) ?
> + (void *)tld->data_map->udata[0].page + page_off : NULL;
> + }
> +}
[...]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC v3 0/2] Task local data API
2025-05-01 20:37 ` [PATCH RFC v3 0/2] Task local data API Andrii Nakryiko
@ 2025-05-01 23:26 ` Alexei Starovoitov
2025-05-02 2:22 ` Andrii Nakryiko
2025-05-02 4:26 ` Amery Hung
1 sibling, 1 reply; 18+ messages in thread
From: Alexei Starovoitov @ 2025-05-01 23:26 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Amery Hung, bpf, Network Development, Andrii Nakryiko,
Daniel Borkmann, Tejun Heo, Martin KaFai Lau, Kernel Team
On Thu, May 1, 2025 at 1:37 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> Lack of support for shared libraries is a big limitation, IMO, so I
> think we should design for that support from the very beginning.
>
> FWIW, I think this compile-time static __thread variable definitions
> are unnecessarily limiting and add a non-trivial amount of complexity.
> I think it's more reasonable to go with a more dynamic approach, and
> I'll try to outline what an API (and some implementation details)
> might look like.
>
> Note, all this is related to the user-space side of things. BPF-side
> is unchanged, effectively, except you get a guarantee that all the
> data will definitely be page-aligned, so you won't need to do these
> two uptr pages handling.
>
> First, data structures. You'll have one per-process metadata structure
> describing known keys and their assigned "offsets":
>
> struct tld_metadata {
> int cnt;
> char keys[MAX_KEY_CNT];
> __u16 offs[MAX_KEY_CNT];
> __u16 szs[MAX_KEY_CNT];
> };
In Amery's proposal it's
u16 key_cnt;
struct {
char key[TASK_LOCAL_DATA_KEY_LEN];
u16 off;
} keys[];
If we start allocation from the end of the page then we
don't need 'szs' array.
> typedef struct { int off; } tld_off_t;
This is a good tweak.
> tld_off_t tld_add_key_def(const char *key_name, size_t key_size);
Dropping bpf_ prefix and sticking to tld_ is a good move too.
> This API can be called just once per each key that process cares
> about. And this can be done at any point, really, very dynamically.
> The implementation will:
> - (just once per process) open pinned BPF map, remember its FD;
> - (just once) allocate struct tld_metadata, unless we define it as
> pre-allocated global variable;
The main advantage of the new scheme is support for shared libraries.
That's a big plus, but the requirement that everything (shared libs,
static libs, application itself) has to go through this library
(it would have to be shared ?) is quite inconvenient imo.
Let's tweak this proposal and make the kernel the source of truth.
Then shared libs, static and application would only need to agree
on the protocol of accessing bpf task local storage instead of
being forced to use this "tld" library because it's stateful.
"tld" library will be there, but it will be stateless.
We may need to tweak the kernel uptr api a bit, but at a high level
let all user space things assume that two __uptr-s in that special
task local storage map are the source of truth.
First, the user space (.so, .a or a.out) will do is
map_lookup_elem(tls_map_fd) and if __uptr are pointing somewhere
use that memory regardless of which part of this process allocated it
(assuming that .so, .a or a.out won't be freeing it until exit).
If __uptr udata is still NULL, allocate page-aligned page and
map_update_elem(). The kernel doesn't need to deal with races
here since it's a task local value.
If __uptr umetadata is there after lookup, use that to find
and add keys. Here we need some generic locking mechanism,
since umetadata is per-process. Like test-and-set spinlock
that sits within umetadata region and all .so-s, .a-s, a.out
have the same algorithm to lock and access it.
If not there, allocate a page of umetadata and map_update_elem().
Here the kernel would need to deal with races, but I hope
BPF_NOEXIST should already work correctly? I haven't checked.
Worst case we'd need to add support for BPF_F_LOCK (if not already there).
> - (locklessly) check if key_name is already in tld_metadata, if yes
> - return already assigned offset;
> - (locklessly) if not, add this key and assign it offset that is
> offs[cnt - 1] + szs[cnt - 1] (i.e., we just tightly pack all the
> values (though we should take care of alignment requirements, of
> course);
I'm not quite sure how different processes can do it locklessly.
But if we allocate from the top of the page and there is only one
'offset' field then we can do it lockless. Like:
allocate(size sz)
{
size_t old = atomic_read(&obj->offset);
if ((ssize_t)(old - sz) >= 0 &&
atomic_cmpxchg(&obj->offset, old, old - sz) == old)
return ..; // success
}
No suggestions for the rest of the proposal.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC v3 0/2] Task local data API
2025-05-01 23:26 ` Alexei Starovoitov
@ 2025-05-02 2:22 ` Andrii Nakryiko
2025-05-02 17:55 ` Alexei Starovoitov
0 siblings, 1 reply; 18+ messages in thread
From: Andrii Nakryiko @ 2025-05-02 2:22 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Amery Hung, bpf, Network Development, Andrii Nakryiko,
Daniel Borkmann, Tejun Heo, Martin KaFai Lau, Kernel Team
On Thu, May 1, 2025 at 4:27 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, May 1, 2025 at 1:37 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > Lack of support for shared libraries is a big limitation, IMO, so I
> > think we should design for that support from the very beginning.
> >
> > FWIW, I think this compile-time static __thread variable definitions
> > are unnecessarily limiting and add a non-trivial amount of complexity.
> > I think it's more reasonable to go with a more dynamic approach, and
> > I'll try to outline what an API (and some implementation details)
> > might look like.
> >
> > Note, all this is related to the user-space side of things. BPF-side
> > is unchanged, effectively, except you get a guarantee that all the
> > data will definitely be page-aligned, so you won't need to do these
> > two uptr pages handling.
> >
> > First, data structures. You'll have one per-process metadata structure
> > describing known keys and their assigned "offsets":
> >
> > struct tld_metadata {
> > int cnt;
> > char keys[MAX_KEY_CNT];
> > __u16 offs[MAX_KEY_CNT];
> > __u16 szs[MAX_KEY_CNT];
> > };
>
> In Amery's proposal it's
> u16 key_cnt;
> struct {
> char key[TASK_LOCAL_DATA_KEY_LEN];
> u16 off;
> } keys[];
>
> If we start allocation from the end of the page then we
> don't need 'szs' array.
I wasn't trying to optimize those few bytes taken by szs, tbh.
Allocating from the end of the page bakes in the assumption that we
won't ever need more than one page. I don't know if I'd do that. But
we can just track "next available offset" instead, so it doesn't
really matter much.
>
> > typedef struct { int off; } tld_off_t;
>
> This is a good tweak.
>
> > tld_off_t tld_add_key_def(const char *key_name, size_t key_size);
>
> Dropping bpf_ prefix and sticking to tld_ is a good move too.
>
> > This API can be called just once per each key that process cares
> > about. And this can be done at any point, really, very dynamically.
> > The implementation will:
> > - (just once per process) open pinned BPF map, remember its FD;
> > - (just once) allocate struct tld_metadata, unless we define it as
> > pre-allocated global variable;
>
> The main advantage of the new scheme is support for shared libraries.
> That's a big plus, but the requirement that everything (shared libs,
> static libs, application itself) has to go through this library
> (it would have to be shared ?) is quite inconvenient imo.
It's simple enough to be contained within a single .h file. Everything
else is just __weak symbols for per-process state, so I'm not sure
this is a limitation of any sort.
>
> Let's tweak this proposal and make the kernel the source of truth.
> Then shared libs, static and application would only need to agree
> on the protocol of accessing bpf task local storage instead of
> being forced to use this "tld" library because it's stateful.
> "tld" library will be there, but it will be stateless.
Yes, data structures and the protocol of accessing and synchronizing
the updates is what matters.
>
> We may need to tweak the kernel uptr api a bit, but at a high level
> let all user space things assume that two __uptr-s in that special
> task local storage map are the source of truth.
> First, the user space (.so, .a or a.out) will do is
> map_lookup_elem(tls_map_fd) and if __uptr are pointing somewhere
> use that memory regardless of which part of this process allocated it
> (assuming that .so, .a or a.out won't be freeing it until exit).
> If __uptr udata is still NULL, allocate page-aligned page and
> map_update_elem(). The kernel doesn't need to deal with races
> here since it's a task local value.
Sure. Keep in mind, though, that in practice, whatever different
libraries constitute your application, they all will need to agree on
how the BPF task local storage map FD is obtained.
> If __uptr umetadata is there after lookup, use that to find
> and add keys. Here we need some generic locking mechanism,
> since umetadata is per-process. Like test-and-set spinlock
> that sits within umetadata region and all .so-s, .a-s, a.out
> have the same algorithm to lock and access it.
Yep, see below.
>
> If not there, allocate a page of umetadata and map_update_elem().
> Here the kernel would need to deal with races, but I hope
> BPF_NOEXIST should already work correctly? I haven't checked.
> Worst case we'd need to add support for BPF_F_LOCK (if not already there).
yeah, BPF_NOEXIST should work, I wouldn't do BPF_F_LOCK.
>
> > - (locklessly) check if key_name is already in tld_metadata, if yes
> > - return already assigned offset;
> > - (locklessly) if not, add this key and assign it offset that is
> > offs[cnt - 1] + szs[cnt - 1] (i.e., we just tightly pack all the
> > values (though we should take care of alignment requirements, of
> > course);
>
> I'm not quite sure how different processes can do it locklessly.
There are no different processes, it's all one process, many
threads... Or is that what you meant? tld_metadata is *per process*,
tld_data is *per thread*. Processes don't need to coordinate anything
between themselves, only threads within the process.
As for how I'd do offset allocation and key addition locklessly. You
are right that it can't be done completely locklessly, but just
looping and yielding probably would be fine.
=
Then the sequence of adding the key would be something like below.
I've modified tld_metadata a bit to make this simpler and more
economical (and I fixed definition of keys array of array of chars,
oops):
struct tld_metadata {
int cnt;
int next_off;
char keys[MAX_KEY_CNT][MAX_KEY_LEN];
__u16 offs[MAX_KEY_CNT];
};
struct tld_metadata *m = ...;
const char *new_key = ...;
int i = 0;
/* all m->offs[i] are set to -1 on creation */
again:
int key_cnt = m->cnt;
for (; i < key_cnt; i++) {
while (m->offs[i] < 0) /* update in progress */
sched_yield();
if (strcmp(m->keys[i], new_key) == 0)
return m->offs[i];
if (!cmpxchg(*m->cnt, key_cnt, key_cnt + 1)) {
goto again; /* we raced, key might have been added
already, recheck, but keep i */
/* slot key_cnt is ours, we need to calculate and assign offset */
int new_off = m->next_off;
m->next_off = new_off + key_sz;
m->keys[key_cnt][0] = '\0';
strncat(m->keys[key_cnt], new_key, MAX_KEY_LEN);
/* MEMORY BARRIERS SHOULD BE CAREFULLY CONSIDERED */
m->offs[key_cnt] = new_off; /* this is finalizing key -> offset
assignment */
/* MEMORY BARRIERS SHOULD BE CAREFULLY CONSIDERED */
return new_off; /* we are done */
}
Something like that. There is that looping and yield to not miss
someone else winning the race and adding a key, so that's the locking
part. But given that adding a key definition is supposed to be one
time operation (per key), I don't think we should be fancy with
locking.
> But if we allocate from the top of the page and there is only one
> 'offset' field then we can do it lockless. Like:
> allocate(size sz)
> {
> size_t old = atomic_read(&obj->offset);
>
> if ((ssize_t)(old - sz) >= 0 &&
> atomic_cmpxchg(&obj->offset, old, old - sz) == old)
> return ..; // success
does it really matter whether it's from the top or bottom of the page?
> }
>
> No suggestions for the rest of the proposal.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC v3 0/2] Task local data API
2025-05-01 20:37 ` [PATCH RFC v3 0/2] Task local data API Andrii Nakryiko
2025-05-01 23:26 ` Alexei Starovoitov
@ 2025-05-02 4:26 ` Amery Hung
2025-05-02 16:14 ` Andrii Nakryiko
1 sibling, 1 reply; 18+ messages in thread
From: Amery Hung @ 2025-05-02 4:26 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, tj, martin.lau,
kernel-team
On Thu, May 1, 2025 at 1:37 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Fri, Apr 25, 2025 at 2:40 PM Amery Hung <ameryhung@gmail.com> wrote:
> >
> > Hi,
> >
> > This a respin of uptr KV store. It is renamed to task local data (TLD)
> > as the problem statement and the solution have changed, and it now draws
> > more similarities to pthread thread-specific data.
> >
> > * Overview *
> >
> > This patchset is a continuation of the original UPTR work[0], which aims
> > to provide a fast way for user space programs to pass per-task hints to
> > sched_ext schedulers. UPTR built the foundation by supporting sharing
> > user pages with bpf programs through task local storage maps.
> >
> > Additionally, sched_ext would like to allow multiple developers to share
> > a storage without the need to explicitly agreeing on the layout of it.
> > This simplify code base management and makes experimenting easier.
> > While a centralized storage layout definition would have worked, the
> > friction of synchronizing it across different repos is not desirable.
> >
> > This patchset contains the user space plumbing so that user space and bpf
> > program developers can exchange per-task hints easily through simple
> > interfaces.
> >
> > * Design *
> >
> > BPF task local data is a simple API for sharing task-specific data
> > between user space and bpf programs, where data are refered to using
> > string keys. As shown in the following figure, user space programs can
> > define a task local data using bpf_tld_type_var(). The data is
> > effectively a variable declared with __thread, which every thread owns an
> > independent copy and can be directly accessed. On the bpf side, a task
> > local data first needs to be initialized for every new task once (e.g.,
> > in sched_ext_ops::init_task) using bpf_tld_init_var(). Then, other bpf
> > programs can get a pointer to the data using bpf_tld_lookup(). The task
> > local data APIs refer to data using string keys so developers
> > does not need to deal with addresses of data in a shared storage.
> >
> > ┌─ Application ─────────────────────────────────────────┐
> > │ ┌─ library A ──────────────┐ │
> > │ bpf_tld_type_var(int, X) │ bpf_tld_type_var(int, Y) │ │
> > │ └┬─────────────────────────┘ │
> > └───────┬───────────────────│───────────────────────────┘
> > │ X = 123; │ Y = true;
>
> bpf_tld_type_var() is *defining* variable (i.e., it allocates storage
> for it), right? I think for completeness we need also *declare* macro
> to access that variable from other compilation units
>
>
> > V V
> > + ─ Task local data ─ ─ ─ ─ ─ ─ +
> > | ┌─ task_kvs_map ────────────┐ | ┌─ sched_ext_ops::init_task ──────┐
> > | │ BPF Task local storage │ | │ bpf_tld_init_var(&kvs, X); │
> > | │ ┌───────────────────┐ │ |<─┤ bpf_tld_init_var(&kvs, Y); │
> > | │ │ __uptr *udata │ │ | └─────────────────────────────────┘
> > | │ └───────────────────┘ │ |
> > | │ ┌───────────────────┐ │ | ┌─ Other sched_ext_ops op ────────┐
> > | │ │ __uptr *umetadata │ │ | │ int *y; ├┐
> > | │ └───────────────────┘ │ |<─┤ y = bpf_tld_lookup(&kvs, Y, 1); ││
> > | └───────────────────────────┘ | │ if (y) ││
> > | ┌─ task_kvs_off_map ────────┐ | │ /* do something */ ││
> > | │ BPF Task local storage │ | └┬────────────────────────────────┘│
> > | └───────────────────────────┘ | └─────────────────────────────────┘
> > + ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ +
> >
> > * Implementation *
> >
> > Task local data API hides the memory management from the developers.
> > Internally, it shares user data with bpf programs through udata UPTRs.
> > Task local data from different compilation units are placed into a
> > custom "udata" section by the declaration API, bpf_tld_type_var(), so
> > that they are placed together in the memory. User space will need to
> > call bpf_tld_thread_init() for every new thread to pin udata pages to
> > kernel.
> >
> > The metadata used to address udata is stored in umetadata UPTR. It is
> > generated by constructors inserted by bpf_tld_type_var() and
> > bpf_tld_thread_init(). umetadata is an array of 64 metadata corresponding
> > to each data, which contains the key and the offset of data in udata.
> > During initialization, bpf_tld_init_var() will search umetadata for
> > a matching key and cache its offset in task_kvs_off_map. Later,
> > bpf_tld_lookup() will use the cached offset to retreive a pointer to
> > udata.
> >
> > * Limitation *
> >
> > Currently, it is assumed all key-value pairs are known as a program
> > starts. All compilation units using task local data should be statically
> > linked together so that values are all placed together in a udata section
> > and therefore can be shared with bpf through two UPTRs. The next
>
> Lack of support for shared libraries is a big limitation, IMO, so I
> think we should design for that support from the very beginning.
>
> FWIW, I think this compile-time static __thread variable definitions
> are unnecessarily limiting and add a non-trivial amount of complexity.
> I think it's more reasonable to go with a more dynamic approach, and
> I'll try to outline what an API (and some implementation details)
> might look like.
>
Indeed, the limitation of the static approach plus the restriction of
uptr makes the implementation complicated. I will switch to the
dynamic approach in the next respin.
> Note, all this is related to the user-space side of things. BPF-side
> is unchanged, effectively, except you get a guarantee that all the
> data will definitely be page-aligned, so you won't need to do these
> two uptr pages handling.
>
> First, data structures. You'll have one per-process metadata structure
> describing known keys and their assigned "offsets":
>
> struct tld_metadata {
> int cnt;
> char keys[MAX_KEY_CNT];
> __u16 offs[MAX_KEY_CNT];
> __u16 szs[MAX_KEY_CNT];
> };
>
> Now, each thread will basically have just a blob of data, so,
> technically, per-thread we will just have:
>
> struct tld_data {
> __u8 data[PAGE_SIZE];
> };
>
> By pre-allocating the entire page we avoid lots of complications, so I
> think it's worth doing.
>
> Now, we really need just two APIs on user-space (and I'll use the
> "opaque offset" type as I suggested on another patch):
>
> typedef struct { int off; } tld_off_t;
>
> tld_off_t tld_add_key_def(const char *key_name, size_t key_size);
I will incorporate it into the next iteration.
>
> This API can be called just once per each key that process cares
> about. And this can be done at any point, really, very dynamically.
> The implementation will:
> - (just once per process) open pinned BPF map, remember its FD;
> - (just once) allocate struct tld_metadata, unless we define it as
> pre-allocated global variable;
> - (locklessly) check if key_name is already in tld_metadata, if yes
> - return already assigned offset;
> - (locklessly) if not, add this key and assign it offset that is
> offs[cnt - 1] + szs[cnt - 1] (i.e., we just tightly pack all the
> values (though we should take care of alignment requirements, of
> course);
> - return newly assigned offset;
>
> Now, the second essential API is called for each participating thread
> for each different key. And again, this is all very dynamic. It's
> possible that some threads won't use any of this TLD stuff, in which
> case there will be no overhead (memory or CPU), and not even an entry
> in task local storage map for that thread. So, API:
>
The advantage of no memory wasted for threads that are not using TLD
doesn't seem to be that definite to me. If users add per-process
hints, then this scheme can potentially use a lot more memory (i.e.,
PAGE_SIZE * number of threads). Maybe we need another uptr for
per-process data? Or do you think this is out of the scope of TLD and
we should recommend other solutions?
> void *tld_resolve_key(tld_off_t key_off);
>
> This API will:
> - (just once per thread, which is done trivially by using
> __thread-local global variable to keep track of this) allocate struct
> tld_data dynamically (with page alignment, alloc_aligned(PAGE_SIZE,
> PAGE_SIZE))
> - (just once per thread as well) bpf_map_update_elem() for current
> thread, updating two uptrs: one pointing to global tld_metadata,
> another pointing to thread-local tld_data;
> - return tld_data->data + key_off.off;
>
> That is, this API returns an absolute memory address of a value
> resolved in the context of the current thread.
>
>
> And let's look at how one can make use of this on the user-space side
> to optimally use this API.
>
>
> /* global variables */
> tld_off_t my_key1_off; /* struct my_val */
> tld_off_t my_key2_off; /* int */
>
> __thread struct my_val *my_val1;
> __thread int *my_val2;
>
> ... somewhere in constructor ...
>
> my_key1_off = tld_add_key_def("my_key1", sizeof(struct my_val));
> my_key2_off = tld_add_key_def("my_key2", sizeof(int));
>
> ... and then somewhere in the code that makes use of TLD stuff ...
>
> if (!my_val1) /* this can be initialized once per thread to avoid this
> check (or hidden in a helper accessor function) */
> my_val1 = tld_resolve(my_key1_off);
>
> my_val1->blah_field = 123;
>
> if (!my_val2)
> my_val2 = tld_resolve(my_key2_off);
> *my_val2 = 567;
>
>
> That's pretty much it, I think.
Thanks for elaborating what the alternative approach would look like.
>
> In C++ code base, it should be possible to make this even more
> convenient by using a templated wrapper with thread-local inner
> variable with its own constructor. Adding operator overloading (e.g.,
> operator= and operator->) you get a very naturally looking definition
> and access patterns:
>
> /* global data */
>
> tld_variable<struct my_val> my_val1("my_key1");
> tld_variable<int> my_val2("my_key2");
>
> ... now in the actual TLD-using code, we just do:
>
> my_val1->blah_field = 123;
> my_val2 = 567; /* I think C++ would allow this with operator overloading */
>
> I hope the example explains why it's still fast despite everything
> being dynamic. There is a pointer indirection and that page-sized
> allocation (so that we can cache thread-specific resolved pointers),
> but it seems absolutely fine from performance perspective.
>
> > iteration will explore how bpf task local data can work in dynamic
> > libraries. Maybe more udata UPTRs will be added to pin page of TLS
> > of dynamically loaded modules. Or maybe it will allocate memory for data
> > instead of relying on __thread, and change how user space interact with
> > task local data slightly. The later approach can also save some troubles
> > dealing with the restriction of UPTR.
> >
> > Some other limitations:
> > - Total task local data cannot exceed a page
> > - Only support 64 task local data
> > - Some memory waste for data whose size is not power of two
> > due to UPTR limitation
> >
> > [0] https://lore.kernel.org/bpf/20241023234759.860539-1-martin.lau@linux.dev/
> >
> >
> > Amery Hung (2):
> > selftests/bpf: Introduce task local data
> > selftests/bpf: Test basic workflow of task local data
> >
> > .../bpf/prog_tests/task_local_data.c | 159 +++++++++++++++
> > .../bpf/prog_tests/task_local_data.h | 58 ++++++
> > .../bpf/prog_tests/test_task_local_data.c | 156 +++++++++++++++
> > .../selftests/bpf/progs/task_local_data.h | 181 ++++++++++++++++++
> > .../bpf/progs/test_task_local_data_basic.c | 78 ++++++++
> > .../selftests/bpf/task_local_data_common.h | 49 +++++
> > 6 files changed, 681 insertions(+)
> > create mode 100644 tools/testing/selftests/bpf/prog_tests/task_local_data.c
> > create mode 100644 tools/testing/selftests/bpf/prog_tests/task_local_data.h
> > create mode 100644 tools/testing/selftests/bpf/prog_tests/test_task_local_data.c
> > create mode 100644 tools/testing/selftests/bpf/progs/task_local_data.h
> > create mode 100644 tools/testing/selftests/bpf/progs/test_task_local_data_basic.c
> > create mode 100644 tools/testing/selftests/bpf/task_local_data_common.h
> >
> > --
> > 2.47.1
> >
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC v3 1/2] selftests/bpf: Introduce task local data
2025-04-30 1:44 ` Alexei Starovoitov
@ 2025-05-02 15:00 ` Amery Hung
0 siblings, 0 replies; 18+ messages in thread
From: Amery Hung @ 2025-05-02 15:00 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: bpf, Network Development, Andrii Nakryiko, Daniel Borkmann,
Tejun Heo, Martin KaFai Lau, Kernel Team
On Tue, Apr 29, 2025 at 6:45 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Apr 25, 2025 at 2:40 PM Amery Hung <ameryhung@gmail.com> wrote:
> >
> > Task local data provides simple and fast bpf and user space APIs to
> > exchange per-task data through task local storage map. The user space
> > first declares task local data using bpf_tld_type_key_var() or
> > bpf_tld_type_var(). The data is a thread-specific variable which
> > every thread has its own copy. Then, a bpf_tld_thread_init() needs to
> > be called for every thread to share the data with bpf. Finally, users
> > can directly read/write thread local data without bpf syscall.
> >
> > For bpf programs to see task local data, every data need to be
> > initialized once for every new task using bpf_tld_init_var(). Then
> > bpf programs can retrieve pointers to the data with bpf_tld_lookup().
> >
> > The core of task local storage implementation relies on UPTRs. They
> > pin user pages to the kernel so that user space can share data with bpf
> > programs without syscalls. Both data and the metadata used to access
> > data are pinned via UPTRs.
> >
> > A limitation of UPTR makes the implementation of task local data
> > less trivial than it sounds: memory pinned to UPTR cannot exceed a
> > page and must not cross the page boundary. In addition, the data
> > declaration uses __thread identifier and therefore does not have
> > directly control over the memory allocation. Therefore, several
> > tricks and checks are used to make it work.
> >
> > First, task local data declaration APIs place data in a custom "udata"
> > section so that data from different compilation units will be contiguous
> > in the memory and can be pinned using two UPTRs if they are smaller than
> > one page.
> >
> > To avoid each data from spanning across two pages, they are each aligned
> > to the smallest power of two larget than their sizes.
> >
> > As we don't control the memory allocation for data, we need to figure
> > out the layout of user defined data. This is done by the data
> > declaration API and bpf_tld_thread_init(). The data declaration API
> > will insert constructors for all data, and they are used to find the
> > size and offset of data as well as the beginning and end of the whole
> > udata section. Then, bpf_tld_thread_init() performs a per-thread check
> > to make sure no data will cross the page boundary as udata can start at
> > different offset in a page.
> >
> > Note that for umetadata, we directly aligned_alloc() memory for it and
> > assigned to the UPTR. This is only done once for every process as every
> > tasks shares the same umetadata. The actual thread-specific data offset
> > will be adjusted in the bpf program when calling bpf_tld_init_var().
> >
> > Signed-off-by: Amery Hung <ameryhung@gmail.com>
> > ---
> > .../bpf/prog_tests/task_local_data.c | 159 +++++++++++++++
> > .../bpf/prog_tests/task_local_data.h | 58 ++++++
> > .../selftests/bpf/progs/task_local_data.h | 181 ++++++++++++++++++
> > .../selftests/bpf/task_local_data_common.h | 41 ++++
> > 4 files changed, 439 insertions(+)
> > create mode 100644 tools/testing/selftests/bpf/prog_tests/task_local_data.c
> > create mode 100644 tools/testing/selftests/bpf/prog_tests/task_local_data.h
> > create mode 100644 tools/testing/selftests/bpf/progs/task_local_data.h
> > create mode 100644 tools/testing/selftests/bpf/task_local_data_common.h
> >
> > diff --git a/tools/testing/selftests/bpf/prog_tests/task_local_data.c b/tools/testing/selftests/bpf/prog_tests/task_local_data.c
> > new file mode 100644
> > index 000000000000..5a21514573d2
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/prog_tests/task_local_data.c
> > @@ -0,0 +1,159 @@
> > +#include <fcntl.h>
> > +#include <errno.h>
> > +#include <stdio.h>
> > +#include <pthread.h>
> > +
> > +#include <bpf/bpf.h>
> > +
> > +#include "bpf_util.h"
> > +#include "task_local_data.h"
> > +#include "task_local_storage_helpers.h"
> > +
> > +#define PIDFD_THREAD O_EXCL
> > +
> > +/* To find the start of udata for each thread, insert a dummy variable to udata.
>
> Pls use kernel comment style instead of networking.
>
Thanks for reviewing! Since the next respin will pivot to a dynamic
approach. I will address the issue if they are still applicable.
I will change to kernel comment style.
> > + * Contructors generated for every task local data will figured out the offset
>
> constructors.
>
Not the only typo I made today... I will fix my codespell setup.
> > + * from the beginning of udata to the dummy symbol. Then, every thread can infer
> > + * the start of udata by subtracting the offset from the address of dummy.
> > + */
>
> Pls add the full algorithm involving udata_dummy here.
> How it can be anywhere in the udata section...
> then constructors will keep adjusting udata_start_dummy_off
> and eventually km->off -= udata_start_dummy_off.
> These steps are very tricky.
> So big and detailed comments are necessary.
>
> > +static __thread struct udata_dummy {} udata_dummy SEC("udata");
> > +
> > +static __thread bool task_local_data_thread_inited;
> > +
> > +struct task_local_data {
> > + void *udata_start;
> > + void *udata_end;
> > + int udata_start_dummy_off;
> > + struct meta_page *umetadata;
> > + int umetadata_cnt;
> > + bool umetadata_init;
> > + short udata_sizes[64];
> > + pthread_mutex_t lock;
> > +} task_local_data = {
> > + .udata_start = (void *)-1UL,
> > + .lock = PTHREAD_MUTEX_INITIALIZER,
> > +};
> > +
> > +static void tld_set_data_key_meta(int i, const char *key, short off)
> > +{
> > + task_local_data.umetadata->meta[i].off = off;
> > + strncpy(task_local_data.umetadata->meta[i].key, key, TASK_LOCAL_DATA_KEY_LEN);
> > +}
> > +
> > +static struct key_meta *tld_get_data_key_meta(int i)
> > +{
> > + return &task_local_data.umetadata->meta[i];
> > +}
> > +
> > +static void tld_set_data_size(int i, int size)
> > +{
> > + task_local_data.udata_sizes[i] = size;
> > +}
> > +
> > +static int tld_get_data_size(int i)
> > +{
> > + return task_local_data.udata_sizes[i];
> > +}
>
> The above 4 helpers are single use.
> If nothing else is using them, open code them directly.
> Otherwise it only makes it harder to understand the logic.
>
> > +
> > +void __bpf_tld_var_init(const char *key, void *var, int size)
> > +{
> > + int i;
> > +
> > + i = task_local_data.umetadata_cnt++;
> > +
> > + if (!task_local_data.umetadata) {
> > + if (task_local_data.umetadata_cnt > 1)
> > + return;
> > +
> > + task_local_data.umetadata = aligned_alloc(PAGE_SIZE, PAGE_SIZE);
> > + if (!task_local_data.umetadata)
> > + return;
> > + }
> > +
> > + if (var < task_local_data.udata_start) {
> > + task_local_data.udata_start = var;
> > + task_local_data.udata_start_dummy_off =
> > + (void *)&udata_dummy - task_local_data.udata_start;
> > + }
> > +
> > + if (var + size > task_local_data.udata_end)
> > + task_local_data.udata_end = var + size;
> > +
> > + tld_set_data_key_meta(i, key, var - (void *)&udata_dummy);
> > + tld_set_data_size(i, size);
> > +}
> > +
> > +int bpf_tld_thread_init(void)
> > +{
> > + unsigned long udata_size, udata_start, udata_start_page, udata_end_page;
> > + struct task_local_data_map_value map_val;
> > + int i, task_id, task_fd, map_fd, err;
> > +
> > + if (!task_local_data.umetadata_cnt || task_local_data_thread_inited)
> > + return 0;
> > +
> > + if (task_local_data.umetadata_cnt && !task_local_data.umetadata)
> > + return -ENOMEM;
> > +
> > + udata_start = (unsigned long)&udata_dummy + task_local_data.udata_start_dummy_off;
> > +
> > + pthread_mutex_lock(&task_local_data.lock);
>
> can we drop this?
>
> If .c is part of .h just this line will drag -lpthread dependency.
> I think it's an artifact on the selftest.
> The selftest/library/application can have its own mutex to protect
> or this function can use simple xchg() like exclusion
> without mutexes.
I will drop pthread dependency. The mutex also makes the user space
tld library stateful, which is against the principle that making the
task local storage map the ground truth.
>
> > + for (i = 0; i < task_local_data.umetadata_cnt; i++) {
> > + struct key_meta *km = tld_get_data_key_meta(i);
> > + int size = tld_get_data_size(i);
> > + int off;
> > +
> > + if (!task_local_data.umetadata_init) {
> > + /* Constructors save the offset from udata_dummy to each data
> > + * Now as all ctors have run and the offset between the start of
> > + * udata and udata_dummy is known, adjust the offsets of data
> > + * to be relative to the start of udata
> > + */
> > + km->off -= task_local_data.udata_start_dummy_off;
> > +
> > + /* Data exceeding a page may not be able to be covered by
> > + * two udata UPTRs in every thread
> > + */
> > + if (km->off >= PAGE_SIZE)
> > + return -EOPNOTSUPP;
>
> returns without releasing the mutex...
> One more reason to avoid it.
>
> > + }
> > +
> > + /* A task local data should not span across two pages. */
> > + off = km->off + udata_start;
> > + if ((off & PAGE_MASK) != ((off + size - 1) & PAGE_MASK))
> > + return -EOPNOTSUPP;
> > + }
> > + task_local_data.umetadata_init = true;
> > + pthread_mutex_unlock(&task_local_data.lock);
> > +
> > + udata_size = task_local_data.udata_end - task_local_data.udata_start;
> > + udata_start_page = udata_start & PAGE_MASK;
> > + udata_end_page = (udata_start + udata_size) & PAGE_MASK;
> > +
> > + /* The whole udata can span across two pages for a thread. Use two UPTRs
> > + * to cover the second page in case it happens.
> > + */
> > + map_val.udata_start = udata_start & ~PAGE_MASK;
> > + map_val.udata[0].page = (struct data_page *)(udata_start_page);
> > + map_val.udata[1].page = (udata_start_page == udata_end_page) ? NULL :
> > + (struct data_page *)(udata_start_page + PAGE_SIZE);
> > +
> > + /* umetadata is shared by all threads under the assumption that all
> > + * task local data are defined statically and linked together
> > + */
> > + map_val.umetadata = task_local_data.umetadata;
> > + map_val.umetadata_cnt = task_local_data.umetadata_cnt;
> > +
> > + map_fd = bpf_obj_get(TASK_LOCAL_DATA_MAP_PIN_PATH);
> > + if (map_fd < 0)
> > + return -errno;
> > +
> > + task_id = sys_gettid();
> > + task_fd = sys_pidfd_open(task_id, PIDFD_THREAD);
> > + err = bpf_map_update_elem(map_fd, &task_fd, &map_val, 0);
> > + if (err)
> > + return err;
> > +
> > + task_local_data_thread_inited = true;
> > + return 0;
> > +}
> > diff --git a/tools/testing/selftests/bpf/prog_tests/task_local_data.h b/tools/testing/selftests/bpf/prog_tests/task_local_data.h
> > new file mode 100644
> > index 000000000000..c928e8d2c0a6
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/prog_tests/task_local_data.h
> > @@ -0,0 +1,58 @@
> > +#ifndef __BPF_TASK_LOCAL_DATA_H__
> > +#define __BPF_TASK_LOCAL_DATA_H__
> > +
> > +#include "task_local_data_common.h"
> > +
> > +#define SEC(name) __attribute__((section(name), used))
> > +#define __aligned(x) __attribute__((aligned(x)))
> > +
> > +#define ROUND_UP_POWER_OF_TWO(x) (1UL << (sizeof(x) * 8 - __builtin_clzl(x - 1)))
> > +
> > +void __bpf_tld_var_init(const char *key, void *var, int size);
>
> If possible, let's try to put everything into .h, so it's easier
> to distribute. Otherwise extra .c becomes a headache.
>
I will try to put everything in a .h file in the next respin.
> > +
> > +/**
> > + * @brief bpf_tld_key_type_var() declares a task local data shared with bpf
> > + * programs. The data will be a thread-specific variable which a user space
> > + * program can directly read/write, while bpf programs will need to lookup
> > + * with the string key.
> > + *
> > + * @param key The string key a task local data will be associated with. The
> > + * string will be truncated if the length exceeds TASK_LOCAL_DATA_KEY_LEN
> > + * @param type The type of the task local data
> > + * @param var The name of the task local data
> > + */
> > +#define bpf_tld_key_type_var(key, type, var) \
> > +__thread type var SEC("udata") __aligned(ROUND_UP_POWER_OF_TWO(sizeof(type))); \
> > + \
> > +__attribute__((constructor)) \
> > +void __bpf_tld_##var##_init(void) \
> > +{ \
> > + _Static_assert(sizeof(type) < PAGE_SIZE, \
> > + "data size must not exceed a page"); \
> > + __bpf_tld_var_init(key, &var, sizeof(type)); \
> > +}
> > +
> > +/**
> > + * @brief bpf_tld_key_type_var() declares a task local data shared with bpf
> > + * programs. The data will be a thread-specific variable which a user space
> > + * program can directly read/write, while bpf programs will need to lookup
> > + * the data with the string key same as the variable name.
> > + *
> > + * @param type The type of the task local data
> > + * @param var The name of the task local data as well as the name of the
> > + * key. The key string will be truncated if the length exceeds
> > + * TASK_LOCAL_DATA_KEY_LEN.
> > + */
> > +#define bpf_tld_type_var(type, var) \
> > + bpf_tld_key_type_var(#var, type, var)
>
> Hiding string obfuscates it too much.
> This API doesn't have analogous APIs either in bpf or user space.
> So let's make everything explicit.
> In this case bpf_tld_key_type_var()-like should be the only api
> to declare a variable.
> I would call it bpf_tld_var().
>
I will keep only one flavor of declaration API. It will let users
specify the key string without macro obfuscation.
> > +
> > +/**
> > + * @brief bpf_tld_thread_init() initializes the task local data for the current
> > + * thread. All data are undefined from a bpf program's point of view until
> > + * bpf_tld_thread_init() is called.
> > + *
> > + * @return 0 on success; negative error number on failure
> > + */
> > +int bpf_tld_thread_init(void);
> > +
> > +#endif
> > diff --git a/tools/testing/selftests/bpf/progs/task_local_data.h b/tools/testing/selftests/bpf/progs/task_local_data.h
> > new file mode 100644
> > index 000000000000..7358993ee634
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/task_local_data.h
> > @@ -0,0 +1,181 @@
> > +#include <errno.h>
> > +#include <bpf/bpf_helpers.h>
> > +
> > +#include "task_local_data_common.h"
> > +
> > +#define PAGE_IDX_MASK 0x8000
> > +
> > +/* Overview
> > + *
> > + * Task local data (TLD) allows sharing per-task information between users and
> > + * bpf programs without explicit storage layout managenent. TLD APIs use to
> > + * string keys to access data. Internally, TLD shares user pages throguh two
> > + * UPTRs in a task local storage: udata and umetadata. Data are stored in udata.
> > + * Keys and the offsets of the corresponding data in udata are stored in umetadata.
> > + *
> > + * Usage
> > + *
> > + * Users should initialize every task local data once for every new task before
> > + * using them with bpf_tld_init_var(). It should be done ideally in non-critical
> > + * paths first (e.g., sched_ext_ops::init_task) as it compare key strings and
> > + * cache the offsets of data.
> > + *
> > + * First, user should define struct task_local_data_offsets, which will be used
> > + * to cache the offsets of task local data. Each member of the struct should
> > + * be a short integer with name same as the key name defined in the user space.
> > + * Another task local storage map will be created to save the offsets. For example:
> > + *
> > + * struct task_local_data_offsets {
> > + * short priority;
> > + * short in_critical_section;
> > + * };
>
> The use of 'short' is unusual.
> The kernel and bpf progs always use either u16 or s16.
Will change the use of short to u16.
>
> > + *
> > + * Task local data APIs take a pointer to bpf_task_local_data object as the first
> > + * argument. The object should be declared as a stack variable and initialized via
> > + * bpf_tld_init(). Then, in a bpf program, to cache the offset for a key-value pair,
> > + * call bpf_tld_init_var(). For example, in init_task program:
> > + *
> > + * struct bpf_task_local_data tld;
> > + *
> > + * err = bpf_tld_init(task, &tld);
> > + * if (err)
> > + * return 0;
> > + *
> > + * bpf_tld_init_var(&tld, priority);
> > + * bpf_tld_init_var(&tld, in_critical_section);
> > + *
> > + * Subsequently and in other bpf programs, to lookup task local data, call
> > + * bpf_tld_lookup(). For example:
> > + *
> > + * int *p;
> > + *
> > + * p = bpf_tld_lookup(&tld, priority, sizeof(int));
> > + * if (p)
> > + * // do something depending on *p
> > + */
> > +
> > +struct task_local_data_offsets;
> > +
> > +struct {
> > + __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
> > + __uint(map_flags, BPF_F_NO_PREALLOC);
> > + __type(key, int);
> > + __type(value, struct task_local_data_map_value);
> > + __uint(pinning, LIBBPF_PIN_BY_NAME);
> > +} task_local_data_map SEC(".maps");
> > +
> > +struct {
> > + __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
> > + __uint(map_flags, BPF_F_NO_PREALLOC);
> > + __type(key, int);
> > + __type(value, struct task_local_data_offsets);
> > +} task_local_data_off_map SEC(".maps");
> > +
> > +struct bpf_task_local_data {
> > + struct task_local_data_map_value *data_map;
> > + struct task_local_data_offsets *off_map;
> > +};
> > +
> > +/**
> > + * @brief bpf_tld_init() initializes a bpf_task_local_data object.
> > + *
> > + * @param task The task_struct of the target task
> > + * @param tld A pointer to a bpf_task_local_data object to be initialized
> > + * @return -EINVAL if task KV store is not initialized by the user space for this task;
> > + * -ENOMEM if cached offset map creation fails. In both cases, users can abort, or
> > + * conitnue without calling task KV store APIs as if no key-value pairs are set.
>
> continue
>
> > + */
> > +__attribute__((unused))
> > +static int bpf_tld_init(struct task_struct *task, struct bpf_task_local_data *tld)
> > +{
> > + tld->data_map = bpf_task_storage_get(&task_local_data_map, task, 0, 0);
> > + if (!tld->data_map)
> > + return -EINVAL;
> > +
> > + tld->off_map = bpf_task_storage_get(&task_local_data_off_map, task, 0,
> > + BPF_LOCAL_STORAGE_GET_F_CREATE);
> > + if (!tld->off_map)
> > + return -ENOMEM;
> > +
> > + return 0;
> > +}
> > +
> > +/**
> > + * @brief bpf_tld_init_var() lookups the metadata with a key and caches the offset of
> > + * the value corresponding to the key.
> > + *
> > + * @param tld A pointer to a valid bpf_task_local_data object initialized by bpf_tld_init()
> > + * @param key The key used to lookup the task KV store. Should be one of the
> > + * symbols defined in struct task_local_data_offsets, not a string
> > + */
> > +#define bpf_tld_init_var(tld, key) \
> > + ({ \
> > + (tld)->off_map->key = bpf_tld_fetch_off(tld, #key); \
> > + })
> > +
> > +__attribute__((unused))
> > +static short bpf_tld_fetch_off(struct bpf_task_local_data *tld, const char *key)
> > +{
> > + int i, umetadata_off, umetadata_cnt, udata_start;
> > + void *umetadata, *key_i, *off_i;
> > + short off = 0;
> > +
> > + if (!tld->data_map || !tld->data_map->umetadata)
> > + goto out;
> > +
> > + udata_start = tld->data_map->udata_start;
> > + umetadata_cnt = tld->data_map->umetadata_cnt;
> > + umetadata = tld->data_map->umetadata->meta;
> > +
> > + bpf_for(i, 0, umetadata_cnt) {
> > + umetadata_off = i * sizeof(struct key_meta);
> > + if (umetadata_off > PAGE_SIZE - sizeof(struct key_meta))
> > + break;
> > +
> > + key_i = umetadata + umetadata_off + offsetof(struct key_meta, key);
> > + off_i = umetadata + umetadata_off + offsetof(struct key_meta, off);
> > +
> > + if (!bpf_strncmp(key_i, TASK_LOCAL_DATA_KEY_LEN, key)) {
> > + off = *(short *)(off_i) + udata_start;
> > + if (off >= PAGE_SIZE)
> > + off = (off - PAGE_SIZE) | PAGE_IDX_MASK;
> > + /* Shift cached offset by 1 so that 0 means not initialized */
> > + off += 1;
> > + break;
> > + }
> > + }
> > +out:
> > + return off;
> > +}
> > +
> > +/**
> > + * @brief bpf_tld_lookup() lookups the task KV store using the cached offset
> > + * corresponding to the key.
> > + *
> > + * @param tld A pointer to a valid bpf_task_local_data object initialized by bpf_tld_init()
> > + * @param key The key used to lookup the task KV store. Should be one of the
> > + * symbols defined in struct task_local_data_offsets, not a string
> > + * @param size The size of the value. Must be a known constant value
> > + * @return A pointer to the value corresponding to the key; NULL if the offset
> > + * if not cached or the size is too big
> > + */
> > +#define bpf_tld_lookup(tld, key, size) __bpf_tld_lookup(tld, (tld)->off_map->key, size)
> > +__attribute__((unused))
> > +static void *__bpf_tld_lookup(struct bpf_task_local_data *tld, short cached_off, int size)
> > +{
> > + short page_off, page_idx;
> > +
> > + if (!cached_off--)
> > + return NULL;
> > +
> > + page_off = cached_off & ~PAGE_IDX_MASK;
> > + page_idx = !!(cached_off & PAGE_IDX_MASK);
> > +
> > + if (page_idx) {
> > + return (tld->data_map->udata[1].page && page_off < PAGE_SIZE - size) ?
> > + (void *)tld->data_map->udata[1].page + page_off : NULL;
> > + } else {
> > + return (tld->data_map->udata[0].page && page_off < PAGE_SIZE - size) ?
> > + (void *)tld->data_map->udata[0].page + page_off : NULL;
> > + }
> > +}
> > diff --git a/tools/testing/selftests/bpf/task_local_data_common.h b/tools/testing/selftests/bpf/task_local_data_common.h
> > new file mode 100644
> > index 000000000000..2a0bb724c77c
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/task_local_data_common.h
> > @@ -0,0 +1,41 @@
> > +#ifndef __BPF_TASK_KV_STORE_COMMON_H__
> > +#define __BPF_TASK_KV_STORE_COMMON_H__
> > +
> > +#ifdef __BPF__
> > +struct data_page *dummy_data_page;
> > +struct meta_page *dummy_meta_page;
> > +#else
> > +#define __uptr
> > +#endif
> > +
> > +
> > +#define TASK_LOCAL_DATA_MAP_PIN_PATH "/sys/fs/bpf/task_local_data_map"
> > +#define TASK_LOCAL_DATA_KEY_LEN 62
> > +#define PAGE_SIZE 4096
>
> We have
> enum page_size_enum {
> __PAGE_SIZE = PAGE_SIZE
> };
> inside kernel/bpf/core.c
>
> and bpf progs that include vmlinux.h can use it directly as __PAGE_SIZE.
>
Thanks for the tip. I will get page size from vmlinux.h
> Let's think through upfront how the whole thing works on
> architectures with different page size.
Other than different entries of data users can store, do you see
potential issues?
>
> > +#define PAGE_MASK (~(PAGE_SIZE - 1))
> > +
> > +struct data_page {
> > + char data[PAGE_SIZE];
> > +};
> > +
> > +struct data_page_entry {
> > + struct data_page __uptr *page;
> > +};
> > +
> > +struct key_meta {
> > + char key[TASK_LOCAL_DATA_KEY_LEN];
> > + short off;
> > +};
> > +
> > +struct meta_page {
> > + struct key_meta meta[64];
> > +};
> > +
> > +struct task_local_data_map_value {
> > + struct data_page_entry udata[2];
> > + struct meta_page __uptr *umetadata;
> > + short umetadata_cnt;
> > + short udata_start;
> > +};
> > +
> > +#endif
> > --
> > 2.47.1
> >
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC v3 0/2] Task local data API
2025-05-02 4:26 ` Amery Hung
@ 2025-05-02 16:14 ` Andrii Nakryiko
2025-05-02 18:36 ` Tejun Heo
0 siblings, 1 reply; 18+ messages in thread
From: Andrii Nakryiko @ 2025-05-02 16:14 UTC (permalink / raw)
To: Amery Hung
Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, tj, martin.lau,
kernel-team
On Thu, May 1, 2025 at 9:26 PM Amery Hung <ameryhung@gmail.com> wrote:
>
> On Thu, May 1, 2025 at 1:37 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Fri, Apr 25, 2025 at 2:40 PM Amery Hung <ameryhung@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > This a respin of uptr KV store. It is renamed to task local data (TLD)
> > > as the problem statement and the solution have changed, and it now draws
> > > more similarities to pthread thread-specific data.
> > >
[...]
> >
> > This API can be called just once per each key that process cares
> > about. And this can be done at any point, really, very dynamically.
> > The implementation will:
> > - (just once per process) open pinned BPF map, remember its FD;
> > - (just once) allocate struct tld_metadata, unless we define it as
> > pre-allocated global variable;
> > - (locklessly) check if key_name is already in tld_metadata, if yes
> > - return already assigned offset;
> > - (locklessly) if not, add this key and assign it offset that is
> > offs[cnt - 1] + szs[cnt - 1] (i.e., we just tightly pack all the
> > values (though we should take care of alignment requirements, of
> > course);
> > - return newly assigned offset;
> >
> > Now, the second essential API is called for each participating thread
> > for each different key. And again, this is all very dynamic. It's
> > possible that some threads won't use any of this TLD stuff, in which
> > case there will be no overhead (memory or CPU), and not even an entry
> > in task local storage map for that thread. So, API:
> >
>
> The advantage of no memory wasted for threads that are not using TLD
> doesn't seem to be that definite to me. If users add per-process
> hints, then this scheme can potentially use a lot more memory (i.e.,
> PAGE_SIZE * number of threads). Maybe we need another uptr for
> per-process data? Or do you think this is out of the scope of TLD and
> we should recommend other solutions?
>
I'd keep it simple. One page per thread isn't a big deal at all, in my
mind. If the application has a few threads, then a bunch of kilobytes
is not a big deal. If the application has thousands of threads, then a
few megabytes for this is the least of that application's concern,
it's already heavy-weight as hell. I think we are overpivoting on
saving a few bytes here.
[...]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC v3 0/2] Task local data API
2025-05-02 2:22 ` Andrii Nakryiko
@ 2025-05-02 17:55 ` Alexei Starovoitov
0 siblings, 0 replies; 18+ messages in thread
From: Alexei Starovoitov @ 2025-05-02 17:55 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Amery Hung, bpf, Network Development, Andrii Nakryiko,
Daniel Borkmann, Tejun Heo, Martin KaFai Lau, Kernel Team
On Thu, May 1, 2025 at 7:22 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> I wasn't trying to optimize those few bytes taken by szs, tbh.
> Allocating from the end of the page bakes in the assumption that we
> won't ever need more than one page. I don't know if I'd do that. But
> we can just track "next available offset" instead, so it doesn't
> really matter much.
Right. That works too.
> >
> > I'm not quite sure how different processes can do it locklessly.
>
> There are no different processes, it's all one process, many
> threads... Or is that what you meant? tld_metadata is *per process*,
> tld_data is *per thread*. Processes don't need to coordinate anything
> between themselves, only threads within the process.
Yeah. I confused myself thinking that we need to support this
through fork/exec. Since they will be different processes
they will have their own task local storage map elements,
so any kind of signaling into a child needs to be done in user space.
Using bpf tls map won't work.
So using one "tld" library in multiple threads within a single
process works. No need to complicate things by asking kernel tls map.
"tld" library can keep whatever state it needs,
centralized locking, etc.
> As for how I'd do offset allocation and key addition locklessly. You
> are right that it can't be done completely locklessly, but just
> looping and yielding probably would be fine.
> =
>
> Then the sequence of adding the key would be something like below.
> I've modified tld_metadata a bit to make this simpler and more
> economical (and I fixed definition of keys array of array of chars,
> oops):
>
> struct tld_metadata {
> int cnt;
> int next_off;
> char keys[MAX_KEY_CNT][MAX_KEY_LEN];
> __u16 offs[MAX_KEY_CNT];
> };
>
> struct tld_metadata *m = ...;
> const char *new_key = ...;
> int i = 0;
>
> /* all m->offs[i] are set to -1 on creation */
> again:
>
> int key_cnt = m->cnt;
> for (; i < key_cnt; i++) {
> while (m->offs[i] < 0) /* update in progress */
> sched_yield();
>
> if (strcmp(m->keys[i], new_key) == 0)
> return m->offs[i];
>
> if (!cmpxchg(*m->cnt, key_cnt, key_cnt + 1)) {
> goto again; /* we raced, key might have been added
> already, recheck, but keep i */
>
> /* slot key_cnt is ours, we need to calculate and assign offset */
> int new_off = m->next_off;
> m->next_off = new_off + key_sz;
>
> m->keys[key_cnt][0] = '\0';
> strncat(m->keys[key_cnt], new_key, MAX_KEY_LEN);
>
> /* MEMORY BARRIERS SHOULD BE CAREFULLY CONSIDERED */
>
> m->offs[key_cnt] = new_off; /* this is finalizing key -> offset
> assignment */
>
> /* MEMORY BARRIERS SHOULD BE CAREFULLY CONSIDERED */
>
> return new_off; /* we are done */
> }
>
> Something like that. There is that looping and yield to not miss
> someone else winning the race and adding a key, so that's the locking
> part. But given that adding a key definition is supposed to be one
> time operation (per key), I don't think we should be fancy with
> locking.
something like that should work.
I wish there was some trivial futex wrapper in .h
that can be used instead of pthread_mutex baggage.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC v3 0/2] Task local data API
2025-05-02 16:14 ` Andrii Nakryiko
@ 2025-05-02 18:36 ` Tejun Heo
2025-05-02 20:10 ` Andrii Nakryiko
0 siblings, 1 reply; 18+ messages in thread
From: Tejun Heo @ 2025-05-02 18:36 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Amery Hung, bpf, netdev, alexei.starovoitov, andrii, daniel,
martin.lau, kernel-team
Hello,
On Fri, May 02, 2025 at 09:14:47AM -0700, Andrii Nakryiko wrote:
> > The advantage of no memory wasted for threads that are not using TLD
> > doesn't seem to be that definite to me. If users add per-process
> > hints, then this scheme can potentially use a lot more memory (i.e.,
> > PAGE_SIZE * number of threads). Maybe we need another uptr for
> > per-process data? Or do you think this is out of the scope of TLD and
> > we should recommend other solutions?
>
> I'd keep it simple. One page per thread isn't a big deal at all, in my
> mind. If the application has a few threads, then a bunch of kilobytes
> is not a big deal. If the application has thousands of threads, then a
> few megabytes for this is the least of that application's concern,
> it's already heavy-weight as hell. I think we are overpivoting on
> saving a few bytes here.
It could well be that 4k is a price worth paying but there will be cases
where this matters. With 100k threads - not common but not unheard of
either, that's ~400MB. If the data needed to be shared is small and most of
that is wasted, that's not an insignificant amount. uptr supports sub-page
sizing, right? If keeping sizing dynamic is too complex, can't a process
just set the max size to what it deems appropriate?
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC v3 0/2] Task local data API
2025-05-02 18:36 ` Tejun Heo
@ 2025-05-02 20:10 ` Andrii Nakryiko
2025-05-02 21:23 ` Amery Hung
0 siblings, 1 reply; 18+ messages in thread
From: Andrii Nakryiko @ 2025-05-02 20:10 UTC (permalink / raw)
To: Tejun Heo
Cc: Amery Hung, bpf, netdev, alexei.starovoitov, andrii, daniel,
martin.lau, kernel-team
On Fri, May 2, 2025 at 11:36 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Fri, May 02, 2025 at 09:14:47AM -0700, Andrii Nakryiko wrote:
> > > The advantage of no memory wasted for threads that are not using TLD
> > > doesn't seem to be that definite to me. If users add per-process
> > > hints, then this scheme can potentially use a lot more memory (i.e.,
> > > PAGE_SIZE * number of threads). Maybe we need another uptr for
> > > per-process data? Or do you think this is out of the scope of TLD and
> > > we should recommend other solutions?
> >
> > I'd keep it simple. One page per thread isn't a big deal at all, in my
> > mind. If the application has a few threads, then a bunch of kilobytes
> > is not a big deal. If the application has thousands of threads, then a
> > few megabytes for this is the least of that application's concern,
> > it's already heavy-weight as hell. I think we are overpivoting on
> > saving a few bytes here.
>
> It could well be that 4k is a price worth paying but there will be cases
> where this matters. With 100k threads - not common but not unheard of
> either, that's ~400MB. If the data needed to be shared is small and most of
> that is wasted, that's not an insignificant amount. uptr supports sub-page
> sizing, right? If keeping sizing dynamic is too complex, can't a process
> just set the max size to what it deems appropriate?
>
One page was just a maximum supportable size due to uptr stuff. But it
can absolutely be (much) smaller than that, of course. The main
simplification from having a single fixed-sized data area allocation
is that an application can permanently cache an absolute pointer
returned from tld_resolve_key(). If we allow resizing the data area,
all previously returned pointers could be invalidated. So that's the
only thing. But yeah, if we know that we won't need more than, say 64
bytes, nothing prevents us from allocating just those 64 bytes (per
participating thread) instead of an entire page.
> Thanks.
>
> --
> tejun
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC v3 0/2] Task local data API
2025-05-02 20:10 ` Andrii Nakryiko
@ 2025-05-02 21:23 ` Amery Hung
2025-05-02 22:25 ` Andrii Nakryiko
0 siblings, 1 reply; 18+ messages in thread
From: Amery Hung @ 2025-05-02 21:23 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Tejun Heo, bpf, netdev, alexei.starovoitov, andrii, daniel,
martin.lau, kernel-team
On Fri, May 2, 2025 at 1:11 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Fri, May 2, 2025 at 11:36 AM Tejun Heo <tj@kernel.org> wrote:
> >
> > Hello,
> >
> > On Fri, May 02, 2025 at 09:14:47AM -0700, Andrii Nakryiko wrote:
> > > > The advantage of no memory wasted for threads that are not using TLD
> > > > doesn't seem to be that definite to me. If users add per-process
> > > > hints, then this scheme can potentially use a lot more memory (i.e.,
> > > > PAGE_SIZE * number of threads). Maybe we need another uptr for
> > > > per-process data? Or do you think this is out of the scope of TLD and
> > > > we should recommend other solutions?
> > >
> > > I'd keep it simple. One page per thread isn't a big deal at all, in my
> > > mind. If the application has a few threads, then a bunch of kilobytes
> > > is not a big deal. If the application has thousands of threads, then a
> > > few megabytes for this is the least of that application's concern,
> > > it's already heavy-weight as hell. I think we are overpivoting on
> > > saving a few bytes here.
> >
> > It could well be that 4k is a price worth paying but there will be cases
> > where this matters. With 100k threads - not common but not unheard of
> > either, that's ~400MB. If the data needed to be shared is small and most of
> > that is wasted, that's not an insignificant amount. uptr supports sub-page
> > sizing, right? If keeping sizing dynamic is too complex, can't a process
> > just set the max size to what it deems appropriate?
> >
>
> One page was just a maximum supportable size due to uptr stuff. But it
> can absolutely be (much) smaller than that, of course. The main
> simplification from having a single fixed-sized data area allocation
> is that an application can permanently cache an absolute pointer
> returned from tld_resolve_key(). If we allow resizing the data area,
> all previously returned pointers could be invalidated. So that's the
> only thing. But yeah, if we know that we won't need more than, say 64
> bytes, nothing prevents us from allocating just those 64 bytes (per
> participating thread) instead of an entire page.
>
Since users can add keys on the fly, I feel it is natural to also
allocate data area dynamically. Otherwise, there is going to be this
hard trade-off between data size limit and waste of memory.
We can tweak the implementation to make it allocate data dynamically.
The two user space APIs can remain almost the same, but users should
not cache the pointer returned from tld_resolve_ptr(). The only
difference is changing tld_off_t to the metadata index.
void *tld_resolve_ptr(tld_off_t idx) will allocate data area lazily.
- Record total tld data size in tld_metadata, data_sz.
- Use a __thread variable, th_data_sz, to keep track of allocated
memory for the thread (can be a small number or 0 initially)
- If offs[idx] + szs[idx] > th_data_sz, resize the memory based on sum
(can be exactly the same or roundup to the next power of 2 to prevent
frequent reallocation)
- If offs[idx] + szs[idx] <= th_data_sz, return tld->data + offs[idx]
(fast path)
The downside is data access overhead as pointers cannot be cached, but
I think it is an okay middle ground.
> > Thanks.
> >
> > --
> > tejun
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC v3 0/2] Task local data API
2025-05-02 21:23 ` Amery Hung
@ 2025-05-02 22:25 ` Andrii Nakryiko
0 siblings, 0 replies; 18+ messages in thread
From: Andrii Nakryiko @ 2025-05-02 22:25 UTC (permalink / raw)
To: Amery Hung
Cc: Tejun Heo, bpf, netdev, alexei.starovoitov, andrii, daniel,
martin.lau, kernel-team
On Fri, May 2, 2025 at 2:23 PM Amery Hung <ameryhung@gmail.com> wrote:
>
> On Fri, May 2, 2025 at 1:11 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Fri, May 2, 2025 at 11:36 AM Tejun Heo <tj@kernel.org> wrote:
> > >
> > > Hello,
> > >
> > > On Fri, May 02, 2025 at 09:14:47AM -0700, Andrii Nakryiko wrote:
> > > > > The advantage of no memory wasted for threads that are not using TLD
> > > > > doesn't seem to be that definite to me. If users add per-process
> > > > > hints, then this scheme can potentially use a lot more memory (i.e.,
> > > > > PAGE_SIZE * number of threads). Maybe we need another uptr for
> > > > > per-process data? Or do you think this is out of the scope of TLD and
> > > > > we should recommend other solutions?
> > > >
> > > > I'd keep it simple. One page per thread isn't a big deal at all, in my
> > > > mind. If the application has a few threads, then a bunch of kilobytes
> > > > is not a big deal. If the application has thousands of threads, then a
> > > > few megabytes for this is the least of that application's concern,
> > > > it's already heavy-weight as hell. I think we are overpivoting on
> > > > saving a few bytes here.
> > >
> > > It could well be that 4k is a price worth paying but there will be cases
> > > where this matters. With 100k threads - not common but not unheard of
> > > either, that's ~400MB. If the data needed to be shared is small and most of
> > > that is wasted, that's not an insignificant amount. uptr supports sub-page
> > > sizing, right? If keeping sizing dynamic is too complex, can't a process
> > > just set the max size to what it deems appropriate?
> > >
> >
> > One page was just a maximum supportable size due to uptr stuff. But it
> > can absolutely be (much) smaller than that, of course. The main
> > simplification from having a single fixed-sized data area allocation
> > is that an application can permanently cache an absolute pointer
> > returned from tld_resolve_key(). If we allow resizing the data area,
> > all previously returned pointers could be invalidated. So that's the
> > only thing. But yeah, if we know that we won't need more than, say 64
> > bytes, nothing prevents us from allocating just those 64 bytes (per
> > participating thread) instead of an entire page.
> >
>
> Since users can add keys on the fly, I feel it is natural to also
> allocate data area dynamically. Otherwise, there is going to be this
> hard trade-off between data size limit and waste of memory.
>
> We can tweak the implementation to make it allocate data dynamically.
> The two user space APIs can remain almost the same, but users should
> not cache the pointer returned from tld_resolve_ptr(). The only
> difference is changing tld_off_t to the metadata index.
>
> void *tld_resolve_ptr(tld_off_t idx) will allocate data area lazily.
> - Record total tld data size in tld_metadata, data_sz.
> - Use a __thread variable, th_data_sz, to keep track of allocated
> memory for the thread (can be a small number or 0 initially)
> - If offs[idx] + szs[idx] > th_data_sz, resize the memory based on sum
> (can be exactly the same or roundup to the next power of 2 to prevent
> frequent reallocation)
> - If offs[idx] + szs[idx] <= th_data_sz, return tld->data + offs[idx]
> (fast path)
>
> The downside is data access overhead as pointers cannot be cached, but
> I think it is an okay middle ground.
If it's for something like hinting whether the lock is held or not,
I'd prioritize caching of this pointer and performance, over trying to
save a few bytes.
>
> > > Thanks.
> > >
> > > --
> > > tejun
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2025-05-02 22:25 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-25 21:40 [PATCH RFC v3 0/2] Task local data API Amery Hung
2025-04-25 21:40 ` [PATCH RFC v3 1/2] selftests/bpf: Introduce task local data Amery Hung
2025-04-30 1:44 ` Alexei Starovoitov
2025-05-02 15:00 ` Amery Hung
2025-05-01 20:38 ` Andrii Nakryiko
2025-04-25 21:40 ` [PATCH RFC v3 2/2] selftests/bpf: Test basic workflow of " Amery Hung
2025-04-25 22:12 ` Tejun Heo
2025-04-25 22:51 ` Amery Hung
2025-05-01 20:37 ` [PATCH RFC v3 0/2] Task local data API Andrii Nakryiko
2025-05-01 23:26 ` Alexei Starovoitov
2025-05-02 2:22 ` Andrii Nakryiko
2025-05-02 17:55 ` Alexei Starovoitov
2025-05-02 4:26 ` Amery Hung
2025-05-02 16:14 ` Andrii Nakryiko
2025-05-02 18:36 ` Tejun Heo
2025-05-02 20:10 ` Andrii Nakryiko
2025-05-02 21:23 ` Amery Hung
2025-05-02 22:25 ` Andrii Nakryiko
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).