* [PATCH RFC net-next 0/2] Use per-task storage for XDP-redirects on PREEMPT_RT
@ 2024-02-13 14:58 Sebastian Andrzej Siewior
2024-02-13 14:58 ` [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct " Sebastian Andrzej Siewior
2024-02-13 14:58 ` [PATCH RFC net-next 2/2] net: Move per-CPU flush-lists to bpf_xdp_storage " Sebastian Andrzej Siewior
0 siblings, 2 replies; 25+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-02-13 14:58 UTC (permalink / raw)
To: bpf, netdev
Cc: Björn Töpel, David S. Miller, Alexei Starovoitov,
Andrii Nakryiko, Daniel Borkmann, Eric Dumazet, Hao Luo,
Jakub Kicinski, Jesper Dangaard Brouer, Jiri Olsa, John Fastabend,
Jonathan Lemon, KP Singh, Maciej Fijalkowski, Magnus Karlsson,
Martin KaFai Lau, Paolo Abeni, Peter Zijlstra, Song Liu,
Stanislav Fomichev, Thomas Gleixner, Yonghong Song
In [0] I introduced explicit locking for resources which are otherwise
locked implicit locked by local_bh_disable() and this protections goes
away if the lock in local_bh_disable() is removed on PREEMPT_RT.
There was a big complained once it came to the locking of XDP related
resources during XDP-redirect because it was mostly per-packet and the
locking, even not contended, was seen as a problem [1]. Another XDP
related problem was that I also touched every NIC-driver using XDP. This
complicated the "how to write a NIC driver" further.
To solve both problems I was thinking about an alternative and ended up
with moving the data structure from per-CPU to per-task on PREEMPT_RT.
Instead of adding it to task_struct, I added a pointer there and setup
the struct on stack. In my debug build on x86-64 the struct
bpf_xdp_storage has 112 bytes and collected the per-CPU ressources.
I'm posting here only two patches which replace the XDP redirect part
(patches 14 to 24) from the original series.
[0] https://lore.kernel.org/all/20231215171020.687342-1-bigeasy@linutronix.de/
[1] https://lore.kernel.org/all/CAADnVQKJBpvfyvmgM29FLv+KpLwBBRggXWzwKzaCT9U-4bgxjA@mail.gmail.com/
[2] https://lore.kernel.org/all/20231215145059.3b42ee35@kernel.org
Sebastian
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
2024-02-13 14:58 [PATCH RFC net-next 0/2] Use per-task storage for XDP-redirects on PREEMPT_RT Sebastian Andrzej Siewior
@ 2024-02-13 14:58 ` Sebastian Andrzej Siewior
2024-02-13 20:50 ` Jesper Dangaard Brouer
2024-02-14 16:13 ` Toke Høiland-Jørgensen
2024-02-13 14:58 ` [PATCH RFC net-next 2/2] net: Move per-CPU flush-lists to bpf_xdp_storage " Sebastian Andrzej Siewior
1 sibling, 2 replies; 25+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-02-13 14:58 UTC (permalink / raw)
To: bpf, netdev
Cc: Björn Töpel, David S. Miller, Alexei Starovoitov,
Andrii Nakryiko, Daniel Borkmann, Eric Dumazet, Hao Luo,
Jakub Kicinski, Jesper Dangaard Brouer, Jiri Olsa, John Fastabend,
Jonathan Lemon, KP Singh, Maciej Fijalkowski, Magnus Karlsson,
Martin KaFai Lau, Paolo Abeni, Peter Zijlstra, Song Liu,
Stanislav Fomichev, Thomas Gleixner, Yonghong Song,
Sebastian Andrzej Siewior
The XDP redirect process is two staged:
- bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the
packet and makes decisions. While doing that, the per-CPU variable
bpf_redirect_info is used.
- Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info
and it may also access other per-CPU variables like xskmap_flush_list.
At the very end of the NAPI callback, xdp_do_flush() is invoked which
does not access bpf_redirect_info but will touch the individual per-CPU
lists.
The per-CPU variables are only used in the NAPI callback hence disabling
bottom halves is the only protection mechanism. Users from preemptible
context (like cpu_map_kthread_run()) explicitly disable bottom halves
for protections reasons.
Without locking in local_bh_disable() on PREEMPT_RT this data structure
requires explicit locking to avoid corruption if preemption occurs.
PREEMPT_RT has forced-threaded interrupts enabled and every
NAPI-callback runs in a thread. If each thread has its own data
structure then locking can be avoided and data corruption is also avoided.
Create a struct bpf_xdp_storage which contains struct bpf_redirect_info.
Define the variable on stack, use xdp_storage_set() to set a pointer to
it in task_struct of the current task. Use the __free() annotation to
automatically reset the pointer once function returns. Use a pointer which can
be used by the __free() annotation to avoid invoking the callback the pointer
is NULL. This helps the compiler to optimize the code.
The xdp_storage_set() can nest. For instance local_bh_enable() in
bpf_test_run_xdp_live() may run NET_RX_SOFTIRQ/ net_rx_action() which
also uses xdp_storage_set(). Therefore only the first invocations
updates the per-task pointer.
Use xdp_storage_get_ri() as a wrapper to retrieve the current struct
bpf_redirect_info.
This is only done on PREEMPT_RT. The !PREEMPT_RT builds keep using the
per-CPU variable instead. This should also work for !PREEMPT_RT but
isn't needed.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
include/linux/filter.h | 79 ++++++++++++++++++++++++++++++++++++---
include/linux/sched.h | 5 +++
kernel/bpf/cpumap.c | 2 +
kernel/fork.c | 3 ++
net/bpf/test_run.c | 9 ++++-
net/core/dev.c | 17 +++++++++
net/core/filter.c | 85 +++++++++++++++++++++++++++++++-----------
net/core/lwt_bpf.c | 3 ++
8 files changed, 174 insertions(+), 29 deletions(-)
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 68fb6c8142fec..97c9be9cabfd6 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -704,8 +704,69 @@ struct bpf_redirect_info {
struct bpf_nh_params nh;
};
+#ifndef CONFIG_PREEMPT_RT
DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
+struct bpf_xdp_storage { };
+
+static inline struct bpf_xdp_storage *xdp_storage_set(struct bpf_xdp_storage *xdp_store)
+{
+ return NULL;
+}
+
+static inline void xdp_storage_clear(struct bpf_xdp_storage *xdp_store) { }
+
+static inline struct bpf_redirect_info *xdp_storage_get_ri(void)
+{
+ return this_cpu_ptr(&bpf_redirect_info);
+}
+
+#else
+
+struct bpf_xdp_storage {
+ struct bpf_redirect_info ri;
+};
+
+static inline struct bpf_xdp_storage *xdp_storage_set(struct bpf_xdp_storage *xdp_store)
+{
+ struct task_struct *tsk;
+
+ tsk = current;
+ if (tsk->bpf_xdp_storage != NULL)
+ return NULL;
+ tsk->bpf_xdp_storage = xdp_store;
+ return xdp_store;
+}
+
+static inline void xdp_storage_clear(struct bpf_xdp_storage *xdp_store)
+{
+ struct task_struct *tsk;
+
+ tsk = current;
+ if (tsk->bpf_xdp_storage != xdp_store)
+ return;
+ tsk->bpf_xdp_storage = NULL;
+}
+
+static inline struct bpf_xdp_storage *xdp_storage_get(void)
+{
+ struct bpf_xdp_storage *xdp_store = current->bpf_xdp_storage;
+
+ WARN_ON_ONCE(!xdp_store);
+ return xdp_store;
+}
+
+static inline struct bpf_redirect_info *xdp_storage_get_ri(void)
+{
+ struct bpf_xdp_storage *xdp_store = xdp_storage_get();
+
+ if (!xdp_store)
+ return NULL;
+ return &xdp_store->ri;
+}
+#endif
+DEFINE_FREE(xdp_storage_clear, struct bpf_xdp_storage *, if (_T) xdp_storage_clear(_T));
+
/* flags for bpf_redirect_info kern_flags */
#define BPF_RI_F_RF_NO_DIRECT BIT(0) /* no napi_direct on return_frame */
@@ -974,23 +1035,27 @@ void bpf_clear_redirect_map(struct bpf_map *map);
static inline bool xdp_return_frame_no_direct(void)
{
- struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+ struct bpf_redirect_info *ri = xdp_storage_get_ri();
+ if (!ri)
+ return false;
return ri->kern_flags & BPF_RI_F_RF_NO_DIRECT;
}
static inline void xdp_set_return_frame_no_direct(void)
{
- struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+ struct bpf_redirect_info *ri = xdp_storage_get_ri();
- ri->kern_flags |= BPF_RI_F_RF_NO_DIRECT;
+ if (ri)
+ ri->kern_flags |= BPF_RI_F_RF_NO_DIRECT;
}
static inline void xdp_clear_return_frame_no_direct(void)
{
- struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+ struct bpf_redirect_info *ri = xdp_storage_get_ri();
- ri->kern_flags &= ~BPF_RI_F_RF_NO_DIRECT;
+ if (ri)
+ ri->kern_flags &= ~BPF_RI_F_RF_NO_DIRECT;
}
static inline int xdp_ok_fwd_dev(const struct net_device *fwd,
@@ -1544,9 +1609,11 @@ static __always_inline long __bpf_xdp_redirect_map(struct bpf_map *map, u64 inde
u64 flags, const u64 flag_mask,
void *lookup_elem(struct bpf_map *map, u32 key))
{
- struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+ struct bpf_redirect_info *ri = xdp_storage_get_ri();
const u64 action_mask = XDP_ABORTED | XDP_DROP | XDP_PASS | XDP_TX;
+ if (!ri)
+ return XDP_ABORTED;
/* Lower bits of the flags are used as return code on lookup failure */
if (unlikely(flags & ~(action_mask | flag_mask)))
return XDP_ABORTED;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 111f388d65323..179ea10ae1fd1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -53,6 +53,7 @@ struct bio_list;
struct blk_plug;
struct bpf_local_storage;
struct bpf_run_ctx;
+struct bpf_xdp_storage;
struct capture_control;
struct cfs_rq;
struct fs_struct;
@@ -1501,6 +1502,10 @@ struct task_struct {
/* Used for BPF run context */
struct bpf_run_ctx *bpf_ctx;
#endif
+#ifdef CONFIG_PREEMPT_RT
+ /* Used by BPF for per-TASK xdp storage */
+ struct bpf_xdp_storage *bpf_xdp_storage;
+#endif
#ifdef CONFIG_GCC_PLUGIN_STACKLEAK
unsigned long lowest_stack;
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 8a0bb80fe48a3..c40ae831ab1a6 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -261,10 +261,12 @@ static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
static int cpu_map_kthread_run(void *data)
{
+ struct bpf_xdp_storage xdp_store __cleanup(xdp_storage_clear);
struct bpf_cpu_map_entry *rcpu = data;
complete(&rcpu->kthread_running);
set_current_state(TASK_INTERRUPTIBLE);
+ xdp_storage_set(&xdp_store);
/* When kthread gives stop order, then rcpu have been disconnected
* from map, thus no new packets can enter. Remaining in-flight
diff --git a/kernel/fork.c b/kernel/fork.c
index 0d944e92a43ff..0d8eb8d20963e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2462,6 +2462,9 @@ __latent_entropy struct task_struct *copy_process(
RCU_INIT_POINTER(p->bpf_storage, NULL);
p->bpf_ctx = NULL;
#endif
+#ifdef CONFIG_PREEMPT_RT
+ p->bpf_xdp_storage = NULL;
+#endif
/* Perform scheduler related setup. Assign this task to a CPU. */
retval = sched_fork(clone_flags, p);
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index dfd9193740178..50902d5254115 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -281,7 +281,7 @@ static int xdp_recv_frames(struct xdp_frame **frames, int nframes,
static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
u32 repeat)
{
- struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+ struct bpf_redirect_info *ri;
int err = 0, act, ret, i, nframes = 0, batch_sz;
struct xdp_frame **frames = xdp->frames;
struct xdp_page_head *head;
@@ -293,6 +293,7 @@ static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
batch_sz = min_t(u32, repeat, xdp->batch_size);
local_bh_disable();
+ ri = xdp_storage_get_ri();
xdp_set_return_frame_no_direct();
for (i = 0; i < batch_sz; i++) {
@@ -365,8 +366,10 @@ static int bpf_test_run_xdp_live(struct bpf_prog *prog, struct xdp_buff *ctx,
u32 repeat, u32 batch_size, u32 *time)
{
+ struct bpf_xdp_storage *xdp_store __free(xdp_storage_clear) = NULL;
struct xdp_test_data xdp = { .batch_size = batch_size };
struct bpf_test_timer t = { .mode = NO_MIGRATE };
+ struct bpf_xdp_storage __xdp_store;
int ret;
if (!repeat)
@@ -376,6 +379,7 @@ static int bpf_test_run_xdp_live(struct bpf_prog *prog, struct xdp_buff *ctx,
if (ret)
return ret;
+ xdp_store = xdp_storage_set(&__xdp_store);
bpf_test_timer_enter(&t);
do {
xdp.frame_cnt = 0;
@@ -392,7 +396,9 @@ static int bpf_test_run_xdp_live(struct bpf_prog *prog, struct xdp_buff *ctx,
static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat,
u32 *retval, u32 *time, bool xdp)
{
+ struct bpf_xdp_storage *xdp_store __free(xdp_storage_clear) = NULL;
struct bpf_prog_array_item item = {.prog = prog};
+ struct bpf_xdp_storage __xdp_store;
struct bpf_run_ctx *old_ctx;
struct bpf_cg_run_ctx run_ctx;
struct bpf_test_timer t = { NO_MIGRATE };
@@ -412,6 +418,7 @@ static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat,
if (!repeat)
repeat = 1;
+ xdp_store = xdp_storage_set(&__xdp_store);
bpf_test_timer_enter(&t);
old_ctx = bpf_set_run_ctx(&run_ctx.run_ctx);
do {
diff --git a/net/core/dev.c b/net/core/dev.c
index de362d5f26559..c3f7d2a6b6134 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3988,11 +3988,15 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
struct net_device *orig_dev, bool *another)
{
struct bpf_mprog_entry *entry = rcu_dereference_bh(skb->dev->tcx_ingress);
+ struct bpf_xdp_storage *xdp_store __free(xdp_storage_clear) = NULL;
enum skb_drop_reason drop_reason = SKB_DROP_REASON_TC_INGRESS;
+ struct bpf_xdp_storage __xdp_store;
int sch_ret;
if (!entry)
return skb;
+
+ xdp_store = xdp_storage_set(&__xdp_store);
if (*pt_prev) {
*ret = deliver_skb(skb, *pt_prev, orig_dev);
*pt_prev = NULL;
@@ -4044,12 +4048,16 @@ static __always_inline struct sk_buff *
sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
{
struct bpf_mprog_entry *entry = rcu_dereference_bh(dev->tcx_egress);
+ struct bpf_xdp_storage *xdp_store __free(xdp_storage_clear) = NULL;
enum skb_drop_reason drop_reason = SKB_DROP_REASON_TC_EGRESS;
+ struct bpf_xdp_storage __xdp_store;
int sch_ret;
if (!entry)
return skb;
+ xdp_store = xdp_storage_set(&__xdp_store);
+
/* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was
* already set by the caller.
*/
@@ -6240,10 +6248,13 @@ void napi_busy_loop(unsigned int napi_id,
void *loop_end_arg, bool prefer_busy_poll, u16 budget)
{
unsigned long start_time = loop_end ? busy_loop_current_time() : 0;
+ struct bpf_xdp_storage *xdp_store __free(xdp_storage_clear) = NULL;
int (*napi_poll)(struct napi_struct *napi, int budget);
+ struct bpf_xdp_storage __xdp_store;
void *have_poll_lock = NULL;
struct napi_struct *napi;
+ xdp_store = xdp_storage_set(&__xdp_store);
restart:
napi_poll = NULL;
@@ -6716,10 +6727,13 @@ static void skb_defer_free_flush(struct softnet_data *sd)
static int napi_threaded_poll(void *data)
{
+ struct bpf_xdp_storage xdp_store __cleanup(xdp_storage_clear);
struct napi_struct *napi = data;
struct softnet_data *sd;
void *have;
+ xdp_storage_set(&xdp_store);
+
while (!napi_thread_wait(napi)) {
for (;;) {
bool repoll = false;
@@ -6753,13 +6767,16 @@ static int napi_threaded_poll(void *data)
static __latent_entropy void net_rx_action(struct softirq_action *h)
{
+ struct bpf_xdp_storage *xdp_store __free(xdp_storage_clear) = NULL;
struct softnet_data *sd = this_cpu_ptr(&softnet_data);
unsigned long time_limit = jiffies +
usecs_to_jiffies(READ_ONCE(netdev_budget_usecs));
int budget = READ_ONCE(netdev_budget);
+ struct bpf_xdp_storage __xdp_store;
LIST_HEAD(list);
LIST_HEAD(repoll);
+ xdp_store = xdp_storage_set(&__xdp_store);
start:
sd->in_net_rx_action = true;
local_irq_disable();
diff --git a/net/core/filter.c b/net/core/filter.c
index eb8d5a0a0ec8f..5721acb15d40f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2473,8 +2473,10 @@ static const struct bpf_func_proto bpf_clone_redirect_proto = {
.arg3_type = ARG_ANYTHING,
};
+#ifndef CONFIG_PREEMPT_RT
DEFINE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
EXPORT_PER_CPU_SYMBOL_GPL(bpf_redirect_info);
+#endif
static struct net_device *skb_get_peer_dev(struct net_device *dev)
{
@@ -2488,11 +2490,15 @@ static struct net_device *skb_get_peer_dev(struct net_device *dev)
int skb_do_redirect(struct sk_buff *skb)
{
- struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
struct net *net = dev_net(skb->dev);
+ struct bpf_redirect_info *ri;
struct net_device *dev;
- u32 flags = ri->flags;
+ u32 flags;
+ ri = xdp_storage_get_ri();
+ if (!ri)
+ goto out_drop;
+ flags = ri->flags;
dev = dev_get_by_index_rcu(net, ri->tgt_index);
ri->tgt_index = 0;
ri->flags = 0;
@@ -2521,9 +2527,9 @@ int skb_do_redirect(struct sk_buff *skb)
BPF_CALL_2(bpf_redirect, u32, ifindex, u64, flags)
{
- struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+ struct bpf_redirect_info *ri = xdp_storage_get_ri();
- if (unlikely(flags & (~(BPF_F_INGRESS) | BPF_F_REDIRECT_INTERNAL)))
+ if (unlikely((flags & (~(BPF_F_INGRESS) | BPF_F_REDIRECT_INTERNAL)) || !ri))
return TC_ACT_SHOT;
ri->flags = flags;
@@ -2542,9 +2548,9 @@ static const struct bpf_func_proto bpf_redirect_proto = {
BPF_CALL_2(bpf_redirect_peer, u32, ifindex, u64, flags)
{
- struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+ struct bpf_redirect_info *ri = xdp_storage_get_ri();
- if (unlikely(flags))
+ if (unlikely(flags || !ri))
return TC_ACT_SHOT;
ri->flags = BPF_F_PEER;
@@ -2564,9 +2570,9 @@ static const struct bpf_func_proto bpf_redirect_peer_proto = {
BPF_CALL_4(bpf_redirect_neigh, u32, ifindex, struct bpf_redir_neigh *, params,
int, plen, u64, flags)
{
- struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+ struct bpf_redirect_info *ri = xdp_storage_get_ri();
- if (unlikely((plen && plen < sizeof(*params)) || flags))
+ if (unlikely((plen && plen < sizeof(*params)) || flags || !ri))
return TC_ACT_SHOT;
ri->flags = BPF_F_NEIGH | (plen ? BPF_F_NEXTHOP : 0);
@@ -4292,6 +4298,7 @@ void xdp_do_check_flushed(struct napi_struct *napi)
void bpf_clear_redirect_map(struct bpf_map *map)
{
+#ifndef CONFIG_PREEMPT_RT
struct bpf_redirect_info *ri;
int cpu;
@@ -4305,6 +4312,19 @@ void bpf_clear_redirect_map(struct bpf_map *map)
if (unlikely(READ_ONCE(ri->map) == map))
cmpxchg(&ri->map, map, NULL);
}
+#else
+ /* ri->map is assigned in __bpf_xdp_redirect_map() from within a eBPF
+ * program/ during NAPI callback. It is used during
+ * xdp_do_generic_redirect_map()/ __xdp_do_redirect_frame() from the
+ * redirect callback afterwards. ri->map is cleared after usage.
+ * The path has no explicit RCU read section but the local_bh_disable()
+ * is also a RCU read section which makes the complete softirq callback
+ * RCU protected. This in turn makes ri->map RCU protocted and it is
+ * sufficient to wait a grace period to ensure that no "ri->map == map"
+ * exist. dev_map_free() removes the map from the list and then
+ * invokes synchronize_rcu() after calling this function.
+ */
+#endif
}
DEFINE_STATIC_KEY_FALSE(bpf_master_redirect_enabled_key);
@@ -4313,11 +4333,14 @@ EXPORT_SYMBOL_GPL(bpf_master_redirect_enabled_key);
u32 xdp_master_redirect(struct xdp_buff *xdp)
{
struct net_device *master, *slave;
- struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+ struct bpf_redirect_info *ri;
master = netdev_master_upper_dev_get_rcu(xdp->rxq->dev);
slave = master->netdev_ops->ndo_xdp_get_xmit_slave(master, xdp);
if (slave && slave != xdp->rxq->dev) {
+ ri = xdp_storage_get_ri();
+ if (!ri)
+ return XDP_ABORTED;
/* The target device is different from the receiving device, so
* redirect it to the new device.
* Using XDP_REDIRECT gets the correct behaviour from XDP enabled
@@ -4419,10 +4442,13 @@ static __always_inline int __xdp_do_redirect_frame(struct bpf_redirect_info *ri,
int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
struct bpf_prog *xdp_prog)
{
- struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
- enum bpf_map_type map_type = ri->map_type;
+ struct bpf_redirect_info *ri;
- if (map_type == BPF_MAP_TYPE_XSKMAP)
+ ri = xdp_storage_get_ri();
+ if (!ri)
+ return -EINVAL;
+
+ if (ri->map_type == BPF_MAP_TYPE_XSKMAP)
return __xdp_do_redirect_xsk(ri, dev, xdp, xdp_prog);
return __xdp_do_redirect_frame(ri, dev, xdp_convert_buff_to_frame(xdp),
@@ -4433,10 +4459,13 @@ EXPORT_SYMBOL_GPL(xdp_do_redirect);
int xdp_do_redirect_frame(struct net_device *dev, struct xdp_buff *xdp,
struct xdp_frame *xdpf, struct bpf_prog *xdp_prog)
{
- struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
- enum bpf_map_type map_type = ri->map_type;
+ struct bpf_redirect_info *ri;
- if (map_type == BPF_MAP_TYPE_XSKMAP)
+ ri = xdp_storage_get_ri();
+ if (!ri)
+ return -EINVAL;
+
+ if (ri->map_type == BPF_MAP_TYPE_XSKMAP)
return __xdp_do_redirect_xsk(ri, dev, xdp, xdp_prog);
return __xdp_do_redirect_frame(ri, dev, xdpf, xdp_prog);
@@ -4450,10 +4479,14 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
void *fwd,
enum bpf_map_type map_type, u32 map_id)
{
- struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+ struct bpf_redirect_info *ri = xdp_storage_get_ri();
struct bpf_map *map;
int err;
+ if (!ri) {
+ err = -EINVAL;
+ goto err;
+ }
switch (map_type) {
case BPF_MAP_TYPE_DEVMAP:
fallthrough;
@@ -4495,12 +4528,20 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
struct xdp_buff *xdp, struct bpf_prog *xdp_prog)
{
- struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
- enum bpf_map_type map_type = ri->map_type;
- void *fwd = ri->tgt_value;
- u32 map_id = ri->map_id;
+ struct bpf_redirect_info *ri = xdp_storage_get_ri();
+ enum bpf_map_type map_type;
+ void *fwd;
+ u32 map_id;
int err;
+ if (!ri) {
+ err = -EINVAL;
+ goto err;
+ }
+ map_type = ri->map_type;
+ fwd = ri->tgt_value;
+ map_id = ri->map_id;
+
ri->map_id = 0; /* Valid map id idr range: [1,INT_MAX[ */
ri->map_type = BPF_MAP_TYPE_UNSPEC;
@@ -4529,9 +4570,9 @@ int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
BPF_CALL_2(bpf_xdp_redirect, u32, ifindex, u64, flags)
{
- struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+ struct bpf_redirect_info *ri = xdp_storage_get_ri();
- if (unlikely(flags))
+ if (unlikely(flags || !ri))
return XDP_ABORTED;
/* NB! Map type UNSPEC and map_id == INT_MAX (never generated
diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index a94943681e5aa..54690b85f1fe6 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -38,12 +38,15 @@ static inline struct bpf_lwt *bpf_lwt_lwtunnel(struct lwtunnel_state *lwt)
static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
struct dst_entry *dst, bool can_redirect)
{
+ struct bpf_xdp_storage *xdp_store __free(xdp_storage_clear) = NULL;
+ struct bpf_xdp_storage __xdp_store;
int ret;
/* Disabling BH is needed to protect per-CPU bpf_redirect_info between
* BPF prog and skb_do_redirect().
*/
local_bh_disable();
+ xdp_store = xdp_storage_set(&__xdp_store);
bpf_compute_data_pointers(skb);
ret = bpf_prog_run_save_cb(lwt->prog, skb);
--
2.43.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH RFC net-next 2/2] net: Move per-CPU flush-lists to bpf_xdp_storage on PREEMPT_RT.
2024-02-13 14:58 [PATCH RFC net-next 0/2] Use per-task storage for XDP-redirects on PREEMPT_RT Sebastian Andrzej Siewior
2024-02-13 14:58 ` [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct " Sebastian Andrzej Siewior
@ 2024-02-13 14:58 ` Sebastian Andrzej Siewior
1 sibling, 0 replies; 25+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-02-13 14:58 UTC (permalink / raw)
To: bpf, netdev
Cc: Björn Töpel, David S. Miller, Alexei Starovoitov,
Andrii Nakryiko, Daniel Borkmann, Eric Dumazet, Hao Luo,
Jakub Kicinski, Jesper Dangaard Brouer, Jiri Olsa, John Fastabend,
Jonathan Lemon, KP Singh, Maciej Fijalkowski, Magnus Karlsson,
Martin KaFai Lau, Paolo Abeni, Peter Zijlstra, Song Liu,
Stanislav Fomichev, Thomas Gleixner, Yonghong Song,
Sebastian Andrzej Siewior
The per-CPU flush lists are accessed from within the NAPI callback
(xdp_do_flush() for instance). They are subject to the same problem as struct
bpf_redirect_info.
Add the per-CPU lists cpu_map_flush_list, dev_map_flush_list and
xskmap_map_flush_list to struct bpf_xdp_storage. Add wrappers for the
access. Use it only on PREEMPT_RT, keep the per-CPU lists on
non-PREEMPT_RT builds.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
include/linux/filter.h | 34 ++++++++++++++++++++++++++++++++++
kernel/bpf/cpumap.c | 33 ++++++++++++++++++++++++++-------
kernel/bpf/devmap.c | 33 ++++++++++++++++++++++++++-------
net/xdp/xsk.c | 33 +++++++++++++++++++++++++++------
4 files changed, 113 insertions(+), 20 deletions(-)
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 97c9be9cabfd6..231ecdc431a00 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -725,6 +725,9 @@ static inline struct bpf_redirect_info *xdp_storage_get_ri(void)
struct bpf_xdp_storage {
struct bpf_redirect_info ri;
+ struct list_head cpu_map_flush_list;
+ struct list_head dev_map_flush_list;
+ struct list_head xskmap_map_flush_list;
};
static inline struct bpf_xdp_storage *xdp_storage_set(struct bpf_xdp_storage *xdp_store)
@@ -734,6 +737,9 @@ static inline struct bpf_xdp_storage *xdp_storage_set(struct bpf_xdp_storage *xd
tsk = current;
if (tsk->bpf_xdp_storage != NULL)
return NULL;
+ INIT_LIST_HEAD(&xdp_store->cpu_map_flush_list);
+ INIT_LIST_HEAD(&xdp_store->dev_map_flush_list);
+ INIT_LIST_HEAD(&xdp_store->xskmap_map_flush_list);
tsk->bpf_xdp_storage = xdp_store;
return xdp_store;
}
@@ -764,6 +770,34 @@ static inline struct bpf_redirect_info *xdp_storage_get_ri(void)
return NULL;
return &xdp_store->ri;
}
+
+static inline struct list_head *xdp_storage_get_cpu_map_flush_list(void)
+{
+ struct bpf_xdp_storage *xdp_store = xdp_storage_get();
+
+ if (!xdp_store)
+ return NULL;
+ return &xdp_store->cpu_map_flush_list;
+}
+
+static inline struct list_head *xdp_storage_get_dev_flush_list(void)
+{
+ struct bpf_xdp_storage *xdp_store = xdp_storage_get();
+
+ if (!xdp_store)
+ return NULL;
+ return &xdp_store->dev_map_flush_list;
+}
+
+static inline struct list_head *xdp_storage_get_xskmap_flush_list(void)
+{
+ struct bpf_xdp_storage *xdp_store = xdp_storage_get();
+
+ if (!xdp_store)
+ return NULL;
+ return &xdp_store->xskmap_map_flush_list;
+}
+
#endif
DEFINE_FREE(xdp_storage_clear, struct bpf_xdp_storage *, if (_T) xdp_storage_clear(_T));
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index c40ae831ab1a6..ec5be37399c82 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -78,8 +78,30 @@ struct bpf_cpu_map {
struct bpf_cpu_map_entry __rcu **cpu_map;
};
+#ifndef CONFIG_PREEMPT_RT
static DEFINE_PER_CPU(struct list_head, cpu_map_flush_list);
+static struct list_head *xdp_storage_get_cpu_map_flush_list(void)
+{
+ return this_cpu_ptr(&cpu_map_flush_list);
+}
+
+static void init_cpu_map_flush_list(void)
+{
+ int cpu;
+
+ for_each_possible_cpu(cpu)
+ INIT_LIST_HEAD(&per_cpu(cpu_map_flush_list, cpu));
+}
+
+#else
+
+static void init_cpu_map_flush_list(void)
+{
+}
+
+#endif
+
static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
{
u32 value_size = attr->value_size;
@@ -703,7 +725,7 @@ static void bq_flush_to_queue(struct xdp_bulk_queue *bq)
*/
static void bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_frame *xdpf)
{
- struct list_head *flush_list = this_cpu_ptr(&cpu_map_flush_list);
+ struct list_head *flush_list = xdp_storage_get_cpu_map_flush_list();
struct xdp_bulk_queue *bq = this_cpu_ptr(rcpu->bulkq);
if (unlikely(bq->count == CPU_MAP_BULK_SIZE))
@@ -755,7 +777,7 @@ int cpu_map_generic_redirect(struct bpf_cpu_map_entry *rcpu,
void __cpu_map_flush(void)
{
- struct list_head *flush_list = this_cpu_ptr(&cpu_map_flush_list);
+ struct list_head *flush_list = xdp_storage_get_cpu_map_flush_list();
struct xdp_bulk_queue *bq, *tmp;
list_for_each_entry_safe(bq, tmp, flush_list, flush_node) {
@@ -769,7 +791,7 @@ void __cpu_map_flush(void)
#ifdef CONFIG_DEBUG_NET
bool cpu_map_check_flush(void)
{
- if (list_empty(this_cpu_ptr(&cpu_map_flush_list)))
+ if (list_empty(xdp_storage_get_cpu_map_flush_list()))
return false;
__cpu_map_flush();
return true;
@@ -778,10 +800,7 @@ bool cpu_map_check_flush(void)
static int __init cpu_map_init(void)
{
- int cpu;
-
- for_each_possible_cpu(cpu)
- INIT_LIST_HEAD(&per_cpu(cpu_map_flush_list, cpu));
+ init_cpu_map_flush_list();
return 0;
}
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index a936c704d4e77..c1af5d9d60381 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -83,7 +83,29 @@ struct bpf_dtab {
u32 n_buckets;
};
+#ifndef CONFIG_PREEMPT_RT
static DEFINE_PER_CPU(struct list_head, dev_flush_list);
+
+static struct list_head *xdp_storage_get_dev_flush_list(void)
+{
+ return this_cpu_ptr(&dev_flush_list);
+}
+
+static void init_dev_flush_list(void)
+{
+ int cpu;
+
+ for_each_possible_cpu(cpu)
+ INIT_LIST_HEAD(&per_cpu(dev_flush_list, cpu));
+}
+
+#else
+
+static void init_dev_flush_list(void)
+{
+}
+#endif
+
static DEFINE_SPINLOCK(dev_map_lock);
static LIST_HEAD(dev_map_list);
@@ -407,7 +429,7 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
*/
void __dev_flush(void)
{
- struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
+ struct list_head *flush_list = xdp_storage_get_dev_flush_list();
struct xdp_dev_bulk_queue *bq, *tmp;
list_for_each_entry_safe(bq, tmp, flush_list, flush_node) {
@@ -421,7 +443,7 @@ void __dev_flush(void)
#ifdef CONFIG_DEBUG_NET
bool dev_check_flush(void)
{
- if (list_empty(this_cpu_ptr(&dev_flush_list)))
+ if (list_empty(xdp_storage_get_dev_flush_list()))
return false;
__dev_flush();
return true;
@@ -452,7 +474,7 @@ static void *__dev_map_lookup_elem(struct bpf_map *map, u32 key)
static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
struct net_device *dev_rx, struct bpf_prog *xdp_prog)
{
- struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
+ struct list_head *flush_list = xdp_storage_get_dev_flush_list();
struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
if (unlikely(bq->count == DEV_MAP_BULK_SIZE))
@@ -1155,15 +1177,12 @@ static struct notifier_block dev_map_notifier = {
static int __init dev_map_init(void)
{
- int cpu;
-
/* Assure tracepoint shadow struct _bpf_dtab_netdev is in sync */
BUILD_BUG_ON(offsetof(struct bpf_dtab_netdev, dev) !=
offsetof(struct _bpf_dtab_netdev, dev));
register_netdevice_notifier(&dev_map_notifier);
- for_each_possible_cpu(cpu)
- INIT_LIST_HEAD(&per_cpu(dev_flush_list, cpu));
+ init_dev_flush_list();
return 0;
}
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index b78c0e095e221..3050739cfe1e0 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -35,8 +35,30 @@
#define TX_BATCH_SIZE 32
#define MAX_PER_SOCKET_BUDGET (TX_BATCH_SIZE)
+#ifndef CONFIG_PREEMPT_RT
static DEFINE_PER_CPU(struct list_head, xskmap_flush_list);
+static struct list_head *xdp_storage_get_xskmap_flush_list(void)
+{
+ return this_cpu_ptr(&xskmap_flush_list);
+}
+
+static void init_xskmap_flush_list(void)
+{
+ int cpu;
+
+ for_each_possible_cpu(cpu)
+ INIT_LIST_HEAD(&per_cpu(xskmap_flush_list, cpu));
+}
+
+#else
+
+static void init_xskmap_flush_list(void)
+{
+}
+
+#endif
+
void xsk_set_rx_need_wakeup(struct xsk_buff_pool *pool)
{
if (pool->cached_need_wakeup & XDP_WAKEUP_RX)
@@ -372,7 +394,7 @@ static int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp)
{
- struct list_head *flush_list = this_cpu_ptr(&xskmap_flush_list);
+ struct list_head *flush_list = xdp_storage_get_xskmap_flush_list();
int err;
err = xsk_rcv(xs, xdp);
@@ -387,7 +409,7 @@ int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp)
void __xsk_map_flush(void)
{
- struct list_head *flush_list = this_cpu_ptr(&xskmap_flush_list);
+ struct list_head *flush_list = xdp_storage_get_xskmap_flush_list();
struct xdp_sock *xs, *tmp;
list_for_each_entry_safe(xs, tmp, flush_list, flush_node) {
@@ -399,7 +421,7 @@ void __xsk_map_flush(void)
#ifdef CONFIG_DEBUG_NET
bool xsk_map_check_flush(void)
{
- if (list_empty(this_cpu_ptr(&xskmap_flush_list)))
+ if (list_empty(xdp_storage_get_xskmap_flush_list()))
return false;
__xsk_map_flush();
return true;
@@ -1770,7 +1792,7 @@ static struct pernet_operations xsk_net_ops = {
static int __init xsk_init(void)
{
- int err, cpu;
+ int err;
err = proto_register(&xsk_proto, 0 /* no slab */);
if (err)
@@ -1788,8 +1810,7 @@ static int __init xsk_init(void)
if (err)
goto out_pernet;
- for_each_possible_cpu(cpu)
- INIT_LIST_HEAD(&per_cpu(xskmap_flush_list, cpu));
+ init_xskmap_flush_list();
return 0;
out_pernet:
--
2.43.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
2024-02-13 14:58 ` [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct " Sebastian Andrzej Siewior
@ 2024-02-13 20:50 ` Jesper Dangaard Brouer
2024-02-14 12:19 ` Sebastian Andrzej Siewior
2024-02-14 16:13 ` Toke Høiland-Jørgensen
1 sibling, 1 reply; 25+ messages in thread
From: Jesper Dangaard Brouer @ 2024-02-13 20:50 UTC (permalink / raw)
To: Sebastian Andrzej Siewior, bpf, netdev
Cc: Björn Töpel, David S. Miller, Alexei Starovoitov,
Andrii Nakryiko, Daniel Borkmann, Eric Dumazet, Hao Luo,
Jakub Kicinski, Jiri Olsa, John Fastabend, Jonathan Lemon,
KP Singh, Maciej Fijalkowski, Magnus Karlsson, Martin KaFai Lau,
Paolo Abeni, Peter Zijlstra, Song Liu, Stanislav Fomichev,
Thomas Gleixner, Yonghong Song
On 13/02/2024 15.58, Sebastian Andrzej Siewior wrote:
> The XDP redirect process is two staged:
> - bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the
> packet and makes decisions. While doing that, the per-CPU variable
> bpf_redirect_info is used.
>
> - Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info
> and it may also access other per-CPU variables like xskmap_flush_list.
>
> At the very end of the NAPI callback, xdp_do_flush() is invoked which
> does not access bpf_redirect_info but will touch the individual per-CPU
> lists.
>
> The per-CPU variables are only used in the NAPI callback hence disabling
> bottom halves is the only protection mechanism. Users from preemptible
> context (like cpu_map_kthread_run()) explicitly disable bottom halves
> for protections reasons.
> Without locking in local_bh_disable() on PREEMPT_RT this data structure
> requires explicit locking to avoid corruption if preemption occurs.
>
> PREEMPT_RT has forced-threaded interrupts enabled and every
> NAPI-callback runs in a thread. If each thread has its own data
> structure then locking can be avoided and data corruption is also avoided.
>
> Create a struct bpf_xdp_storage which contains struct bpf_redirect_info.
> Define the variable on stack, use xdp_storage_set() to set a pointer to
> it in task_struct of the current task. Use the __free() annotation to
> automatically reset the pointer once function returns. Use a pointer which can
> be used by the __free() annotation to avoid invoking the callback the pointer
> is NULL. This helps the compiler to optimize the code.
> The xdp_storage_set() can nest. For instance local_bh_enable() in
> bpf_test_run_xdp_live() may run NET_RX_SOFTIRQ/ net_rx_action() which
> also uses xdp_storage_set(). Therefore only the first invocations
> updates the per-task pointer.
> Use xdp_storage_get_ri() as a wrapper to retrieve the current struct
> bpf_redirect_info.
>
> This is only done on PREEMPT_RT. The !PREEMPT_RT builds keep using the
> per-CPU variable instead. This should also work for !PREEMPT_RT but
> isn't needed.
> > Signed-off-by: Sebastian Andrzej Siewior<bigeasy@linutronix.de>
I generally like the idea around bpf_xdp_storage.
I only skimmed the code, but noticed some extra if-statements (for
!NULL). I don't think they will make a difference, but I know Toke want
me to test it...
I'll hopefully have time to look at code closer tomorrow.
--Jesper
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
2024-02-13 20:50 ` Jesper Dangaard Brouer
@ 2024-02-14 12:19 ` Sebastian Andrzej Siewior
2024-02-14 13:23 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 25+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-02-14 12:19 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: bpf, netdev, Björn Töpel, David S. Miller,
Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Eric Dumazet, Hao Luo, Jakub Kicinski, Jiri Olsa, John Fastabend,
Jonathan Lemon, KP Singh, Maciej Fijalkowski, Magnus Karlsson,
Martin KaFai Lau, Paolo Abeni, Peter Zijlstra, Song Liu,
Stanislav Fomichev, Thomas Gleixner, Yonghong Song
On 2024-02-13 21:50:51 [+0100], Jesper Dangaard Brouer wrote:
> I generally like the idea around bpf_xdp_storage.
>
> I only skimmed the code, but noticed some extra if-statements (for
> !NULL). I don't think they will make a difference, but I know Toke want
> me to test it...
I've been looking at the assembly for the return value of
bpf_redirect_info() and there is a NULL pointer check. I hoped it was
obvious to be nun-NULL because it is a static struct.
Should this become a problem I could add
"__attribute__((returns_nonnull))" to the declaration of the function
which will optimize the NULL check away.
> I'll hopefully have time to look at code closer tomorrow.
>
> --Jesper
Sebastian
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
2024-02-14 12:19 ` Sebastian Andrzej Siewior
@ 2024-02-14 13:23 ` Toke Høiland-Jørgensen
2024-02-14 14:28 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 25+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-02-14 13:23 UTC (permalink / raw)
To: Sebastian Andrzej Siewior, Jesper Dangaard Brouer
Cc: bpf, netdev, Björn Töpel, David S. Miller,
Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Eric Dumazet, Hao Luo, Jakub Kicinski, Jiri Olsa, John Fastabend,
Jonathan Lemon, KP Singh, Maciej Fijalkowski, Magnus Karlsson,
Martin KaFai Lau, Paolo Abeni, Peter Zijlstra, Song Liu,
Stanislav Fomichev, Thomas Gleixner, Yonghong Song
Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:
> On 2024-02-13 21:50:51 [+0100], Jesper Dangaard Brouer wrote:
>> I generally like the idea around bpf_xdp_storage.
>>
>> I only skimmed the code, but noticed some extra if-statements (for
>> !NULL). I don't think they will make a difference, but I know Toke want
>> me to test it...
>
> I've been looking at the assembly for the return value of
> bpf_redirect_info() and there is a NULL pointer check. I hoped it was
> obvious to be nun-NULL because it is a static struct.
>
> Should this become a problem I could add
> "__attribute__((returns_nonnull))" to the declaration of the function
> which will optimize the NULL check away.
If we know the function will never return NULL (I was wondering about
that, actually), why have the check in the C code at all? Couldn't we just
omit it entirely instead of relying on the compiler to optimise it out?
-Toke
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
2024-02-14 13:23 ` Toke Høiland-Jørgensen
@ 2024-02-14 14:28 ` Sebastian Andrzej Siewior
2024-02-14 16:08 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 25+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-02-14 14:28 UTC (permalink / raw)
To: Toke Høiland-Jørgensen
Cc: Jesper Dangaard Brouer, bpf, netdev, Björn Töpel,
David S. Miller, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Eric Dumazet, Hao Luo, Jakub Kicinski, Jiri Olsa,
John Fastabend, Jonathan Lemon, KP Singh, Maciej Fijalkowski,
Magnus Karlsson, Martin KaFai Lau, Paolo Abeni, Peter Zijlstra,
Song Liu, Stanislav Fomichev, Thomas Gleixner, Yonghong Song
On 2024-02-14 14:23:10 [+0100], Toke Høiland-Jørgensen wrote:
> Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:
>
> > On 2024-02-13 21:50:51 [+0100], Jesper Dangaard Brouer wrote:
> >> I generally like the idea around bpf_xdp_storage.
> >>
> >> I only skimmed the code, but noticed some extra if-statements (for
> >> !NULL). I don't think they will make a difference, but I know Toke want
> >> me to test it...
> >
> > I've been looking at the assembly for the return value of
> > bpf_redirect_info() and there is a NULL pointer check. I hoped it was
> > obvious to be nun-NULL because it is a static struct.
> >
> > Should this become a problem I could add
> > "__attribute__((returns_nonnull))" to the declaration of the function
> > which will optimize the NULL check away.
>
> If we know the function will never return NULL (I was wondering about
> that, actually), why have the check in the C code at all? Couldn't we just
> omit it entirely instead of relying on the compiler to optimise it out?
The !RT version does:
| static inline struct bpf_redirect_info *xdp_storage_get_ri(void)
| {
| return this_cpu_ptr(&bpf_redirect_info);
| }
which is static and can't be NULL (unless by mysterious ways the per-CPU
offset + bpf_redirect_info offset is NULL). Maybe I can put this in
this_cpu_ptr()… Let me think about it.
For RT I have:
| static inline struct bpf_xdp_storage *xdp_storage_get(void)
| {
| struct bpf_xdp_storage *xdp_store = current->bpf_xdp_storage;
|
| WARN_ON_ONCE(!xdp_store);
| return xdp_store;
| }
|
| static inline struct bpf_redirect_info *xdp_storage_get_ri(void)
| {
| struct bpf_xdp_storage *xdp_store = xdp_storage_get();
|
| if (!xdp_store)
| return NULL;
| return &xdp_store->ri;
| }
so if current->bpf_xdp_storage is NULL then we get a warning and a NULL
pointer. This *should* not happen due to xdp_storage_set() which
assigns the pointer. However if I missed a spot then there is the check
which aborts further processing.
During testing I forgot a spot in egress and the test module. You could
argue that the warning is enough since it should pop up in testing and
not production because the code is always missed and not by chance (go
boom, send a report). I *think* I covered all spots, at least the test
suite didn't point anything out to me.
I was unsure if I need something around net_tx_action() due to
TC_ACT_REDIRECT (I think qdisc) but this seems to be handled by
sch_handle_egress().
> -Toke
Sebastian
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
2024-02-14 14:28 ` Sebastian Andrzej Siewior
@ 2024-02-14 16:08 ` Toke Høiland-Jørgensen
2024-02-14 16:36 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 25+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-02-14 16:08 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: Jesper Dangaard Brouer, bpf, netdev, Björn Töpel,
David S. Miller, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Eric Dumazet, Hao Luo, Jakub Kicinski, Jiri Olsa,
John Fastabend, Jonathan Lemon, KP Singh, Maciej Fijalkowski,
Magnus Karlsson, Martin KaFai Lau, Paolo Abeni, Peter Zijlstra,
Song Liu, Stanislav Fomichev, Thomas Gleixner, Yonghong Song
Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:
> On 2024-02-14 14:23:10 [+0100], Toke Høiland-Jørgensen wrote:
>> Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:
>>
>> > On 2024-02-13 21:50:51 [+0100], Jesper Dangaard Brouer wrote:
>> >> I generally like the idea around bpf_xdp_storage.
>> >>
>> >> I only skimmed the code, but noticed some extra if-statements (for
>> >> !NULL). I don't think they will make a difference, but I know Toke want
>> >> me to test it...
>> >
>> > I've been looking at the assembly for the return value of
>> > bpf_redirect_info() and there is a NULL pointer check. I hoped it was
>> > obvious to be nun-NULL because it is a static struct.
>> >
>> > Should this become a problem I could add
>> > "__attribute__((returns_nonnull))" to the declaration of the function
>> > which will optimize the NULL check away.
>>
>> If we know the function will never return NULL (I was wondering about
>> that, actually), why have the check in the C code at all? Couldn't we just
>> omit it entirely instead of relying on the compiler to optimise it out?
>
> The !RT version does:
> | static inline struct bpf_redirect_info *xdp_storage_get_ri(void)
> | {
> | return this_cpu_ptr(&bpf_redirect_info);
> | }
>
> which is static and can't be NULL (unless by mysterious ways the per-CPU
> offset + bpf_redirect_info offset is NULL). Maybe I can put this in
> this_cpu_ptr()… Let me think about it.
>
> For RT I have:
> | static inline struct bpf_xdp_storage *xdp_storage_get(void)
> | {
> | struct bpf_xdp_storage *xdp_store = current->bpf_xdp_storage;
> |
> | WARN_ON_ONCE(!xdp_store);
> | return xdp_store;
> | }
> |
> | static inline struct bpf_redirect_info *xdp_storage_get_ri(void)
> | {
> | struct bpf_xdp_storage *xdp_store = xdp_storage_get();
> |
> | if (!xdp_store)
> | return NULL;
> | return &xdp_store->ri;
> | }
>
> so if current->bpf_xdp_storage is NULL then we get a warning and a NULL
> pointer. This *should* not happen due to xdp_storage_set() which
> assigns the pointer. However if I missed a spot then there is the check
> which aborts further processing.
>
> During testing I forgot a spot in egress and the test module. You could
> argue that the warning is enough since it should pop up in testing and
> not production because the code is always missed and not by chance (go
> boom, send a report). I *think* I covered all spots, at least the test
> suite didn't point anything out to me.
Well, I would prefer if we could make sure we covered everything and not
have this odd failure mode where redirect just mysteriously stops
working. At the very least, if we keep the check we should have a
WARN_ON in there to make it really obvious that something needs to be
fixed.
This brings me to another thing I was going to point out separately, but
may as well mention it here: It would be good if we could keep the
difference between the RT and !RT versions as small as possible to avoid
having subtle bugs that only appear in one configuration.
I agree with Jesper that the concept of a stack-allocated "run context"
for the XDP program makes sense in general (and I have some vague ideas
about other things that may be useful to stick in there). So I'm
wondering if it makes sense to do that even in the !RT case? We can't
stick the pointer to it into 'current' when running in softirq, but we
could change the per-cpu variable to just be a pointer that gets
populated by xdp_storage_set()?
I'm not really sure if this would be performance neutral (it's just
moving around a few bits of memory, but we do gain an extra pointer
deref), but it should be simple enough to benchmark.
> I was unsure if I need something around net_tx_action() due to
> TC_ACT_REDIRECT (I think qdisc) but this seems to be handled by
> sch_handle_egress().
Yup, I believe you're correct.
-Toke
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
2024-02-13 14:58 ` [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct " Sebastian Andrzej Siewior
2024-02-13 20:50 ` Jesper Dangaard Brouer
@ 2024-02-14 16:13 ` Toke Høiland-Jørgensen
2024-02-15 9:04 ` Sebastian Andrzej Siewior
1 sibling, 1 reply; 25+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-02-14 16:13 UTC (permalink / raw)
To: Sebastian Andrzej Siewior, bpf, netdev
Cc: Björn Töpel, David S. Miller, Alexei Starovoitov,
Andrii Nakryiko, Daniel Borkmann, Eric Dumazet, Hao Luo,
Jakub Kicinski, Jesper Dangaard Brouer, Jiri Olsa, John Fastabend,
Jonathan Lemon, KP Singh, Maciej Fijalkowski, Magnus Karlsson,
Martin KaFai Lau, Paolo Abeni, Peter Zijlstra, Song Liu,
Stanislav Fomichev, Thomas Gleixner, Yonghong Song,
Sebastian Andrzej Siewior
Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:
> The XDP redirect process is two staged:
> - bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the
> packet and makes decisions. While doing that, the per-CPU variable
> bpf_redirect_info is used.
>
> - Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info
> and it may also access other per-CPU variables like xskmap_flush_list.
>
> At the very end of the NAPI callback, xdp_do_flush() is invoked which
> does not access bpf_redirect_info but will touch the individual per-CPU
> lists.
>
> The per-CPU variables are only used in the NAPI callback hence disabling
> bottom halves is the only protection mechanism. Users from preemptible
> context (like cpu_map_kthread_run()) explicitly disable bottom halves
> for protections reasons.
> Without locking in local_bh_disable() on PREEMPT_RT this data structure
> requires explicit locking to avoid corruption if preemption occurs.
>
> PREEMPT_RT has forced-threaded interrupts enabled and every
> NAPI-callback runs in a thread. If each thread has its own data
> structure then locking can be avoided and data corruption is also avoided.
>
> Create a struct bpf_xdp_storage which contains struct bpf_redirect_info.
> Define the variable on stack, use xdp_storage_set() to set a pointer to
> it in task_struct of the current task. Use the __free() annotation to
> automatically reset the pointer once function returns. Use a pointer which can
> be used by the __free() annotation to avoid invoking the callback the pointer
> is NULL. This helps the compiler to optimize the code.
> The xdp_storage_set() can nest. For instance local_bh_enable() in
> bpf_test_run_xdp_live() may run NET_RX_SOFTIRQ/ net_rx_action() which
> also uses xdp_storage_set(). Therefore only the first invocations
> updates the per-task pointer.
> Use xdp_storage_get_ri() as a wrapper to retrieve the current struct
> bpf_redirect_info.
>
> This is only done on PREEMPT_RT. The !PREEMPT_RT builds keep using the
> per-CPU variable instead. This should also work for !PREEMPT_RT but
> isn't needed.
>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
[...]
> diff --git a/net/core/dev.c b/net/core/dev.c
> index de362d5f26559..c3f7d2a6b6134 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3988,11 +3988,15 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
> struct net_device *orig_dev, bool *another)
> {
> struct bpf_mprog_entry *entry = rcu_dereference_bh(skb->dev->tcx_ingress);
> + struct bpf_xdp_storage *xdp_store __free(xdp_storage_clear) = NULL;
> enum skb_drop_reason drop_reason = SKB_DROP_REASON_TC_INGRESS;
> + struct bpf_xdp_storage __xdp_store;
> int sch_ret;
>
> if (!entry)
> return skb;
> +
> + xdp_store = xdp_storage_set(&__xdp_store);
> if (*pt_prev) {
> *ret = deliver_skb(skb, *pt_prev, orig_dev);
> *pt_prev = NULL;
> @@ -4044,12 +4048,16 @@ static __always_inline struct sk_buff *
> sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
> {
> struct bpf_mprog_entry *entry = rcu_dereference_bh(dev->tcx_egress);
> + struct bpf_xdp_storage *xdp_store __free(xdp_storage_clear) = NULL;
> enum skb_drop_reason drop_reason = SKB_DROP_REASON_TC_EGRESS;
> + struct bpf_xdp_storage __xdp_store;
> int sch_ret;
>
> if (!entry)
> return skb;
>
> + xdp_store = xdp_storage_set(&__xdp_store);
> +
> /* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was
> * already set by the caller.
> */
These, and the LWT code, don't actually have anything to do with XDP,
which indicates that the 'xdp_storage' name misleading. Maybe
'bpf_net_context' or something along those lines? Or maybe we could just
move the flush lists into bpf_redirect_info itself and just keep that as
the top-level name?
-Toke
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
2024-02-14 16:08 ` Toke Høiland-Jørgensen
@ 2024-02-14 16:36 ` Sebastian Andrzej Siewior
2024-02-15 20:23 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 25+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-02-14 16:36 UTC (permalink / raw)
To: Toke Høiland-Jørgensen
Cc: Jesper Dangaard Brouer, bpf, netdev, Björn Töpel,
David S. Miller, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Eric Dumazet, Hao Luo, Jakub Kicinski, Jiri Olsa,
John Fastabend, Jonathan Lemon, KP Singh, Maciej Fijalkowski,
Magnus Karlsson, Martin KaFai Lau, Paolo Abeni, Peter Zijlstra,
Song Liu, Stanislav Fomichev, Thomas Gleixner, Yonghong Song
On 2024-02-14 17:08:44 [+0100], Toke Høiland-Jørgensen wrote:
> > During testing I forgot a spot in egress and the test module. You could
> > argue that the warning is enough since it should pop up in testing and
> > not production because the code is always missed and not by chance (go
> > boom, send a report). I *think* I covered all spots, at least the test
> > suite didn't point anything out to me.
>
> Well, I would prefer if we could make sure we covered everything and not
> have this odd failure mode where redirect just mysteriously stops
> working. At the very least, if we keep the check we should have a
> WARN_ON in there to make it really obvious that something needs to be
> fixed.
Agree.
> This brings me to another thing I was going to point out separately, but
> may as well mention it here: It would be good if we could keep the
> difference between the RT and !RT versions as small as possible to avoid
> having subtle bugs that only appear in one configuration.
Yes. I do so, too.
> I agree with Jesper that the concept of a stack-allocated "run context"
> for the XDP program makes sense in general (and I have some vague ideas
> about other things that may be useful to stick in there). So I'm
> wondering if it makes sense to do that even in the !RT case? We can't
> stick the pointer to it into 'current' when running in softirq, but we
> could change the per-cpu variable to just be a pointer that gets
> populated by xdp_storage_set()?
I *think* that it could be added to current. The assignment currently
allows nesting so that is not a problem. Only the outer most set/clear
would do something. If you run in softirq, you would hijack a random
task_struct. If the pointer is already assigned then the list and so one
must be empty because access is only allowed in BH-disabled sections.
However, using per-CPU for the pointer (instead of task_struct) would
have the advantage that it is always CPU/node local memory while the
random task_struct could have been allocated on a different NUMA node.
> I'm not really sure if this would be performance neutral (it's just
> moving around a few bits of memory, but we do gain an extra pointer
> deref), but it should be simple enough to benchmark.
My guess is that we remain with one per-CPU dereference and an
additional "add offset". That is why I kept the !RT bits as they are
before getting yelled at.
I could prepare something and run a specific test if you have one.
> -Toke
Sebastian
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
2024-02-14 16:13 ` Toke Høiland-Jørgensen
@ 2024-02-15 9:04 ` Sebastian Andrzej Siewior
2024-02-15 12:11 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 25+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-02-15 9:04 UTC (permalink / raw)
To: Toke Høiland-Jørgensen
Cc: bpf, netdev, Björn Töpel, David S. Miller,
Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Eric Dumazet, Hao Luo, Jakub Kicinski, Jesper Dangaard Brouer,
Jiri Olsa, John Fastabend, Jonathan Lemon, KP Singh,
Maciej Fijalkowski, Magnus Karlsson, Martin KaFai Lau,
Paolo Abeni, Peter Zijlstra, Song Liu, Stanislav Fomichev,
Thomas Gleixner, Yonghong Song
On 2024-02-14 17:13:03 [+0100], Toke Høiland-Jørgensen wrote:
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index de362d5f26559..c3f7d2a6b6134 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -4044,12 +4048,16 @@ static __always_inline struct sk_buff *
> > sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
> > {
> > struct bpf_mprog_entry *entry = rcu_dereference_bh(dev->tcx_egress);
> > + struct bpf_xdp_storage *xdp_store __free(xdp_storage_clear) = NULL;
> > enum skb_drop_reason drop_reason = SKB_DROP_REASON_TC_EGRESS;
> > + struct bpf_xdp_storage __xdp_store;
> > int sch_ret;
> >
> > if (!entry)
> > return skb;
> >
> > + xdp_store = xdp_storage_set(&__xdp_store);
> > +
> > /* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was
> > * already set by the caller.
> > */
>
>
> These, and the LWT code, don't actually have anything to do with XDP,
> which indicates that the 'xdp_storage' name misleading. Maybe
> 'bpf_net_context' or something along those lines? Or maybe we could just
> move the flush lists into bpf_redirect_info itself and just keep that as
> the top-level name?
I'm going to rename it for now as suggested for now. If it is a better
fit to include the lists into bpf_redirect_info then I will update
accordingly.
> -Toke
Sebastian
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
2024-02-15 9:04 ` Sebastian Andrzej Siewior
@ 2024-02-15 12:11 ` Toke Høiland-Jørgensen
0 siblings, 0 replies; 25+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-02-15 12:11 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: bpf, netdev, Björn Töpel, David S. Miller,
Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Eric Dumazet, Hao Luo, Jakub Kicinski, Jesper Dangaard Brouer,
Jiri Olsa, John Fastabend, Jonathan Lemon, KP Singh,
Maciej Fijalkowski, Magnus Karlsson, Martin KaFai Lau,
Paolo Abeni, Peter Zijlstra, Song Liu, Stanislav Fomichev,
Thomas Gleixner, Yonghong Song
Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:
> On 2024-02-14 17:13:03 [+0100], Toke Høiland-Jørgensen wrote:
>> > diff --git a/net/core/dev.c b/net/core/dev.c
>> > index de362d5f26559..c3f7d2a6b6134 100644
>> > --- a/net/core/dev.c
>> > +++ b/net/core/dev.c
>> > @@ -4044,12 +4048,16 @@ static __always_inline struct sk_buff *
>> > sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
>> > {
>> > struct bpf_mprog_entry *entry = rcu_dereference_bh(dev->tcx_egress);
>> > + struct bpf_xdp_storage *xdp_store __free(xdp_storage_clear) = NULL;
>> > enum skb_drop_reason drop_reason = SKB_DROP_REASON_TC_EGRESS;
>> > + struct bpf_xdp_storage __xdp_store;
>> > int sch_ret;
>> >
>> > if (!entry)
>> > return skb;
>> >
>> > + xdp_store = xdp_storage_set(&__xdp_store);
>> > +
>> > /* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was
>> > * already set by the caller.
>> > */
>>
>>
>> These, and the LWT code, don't actually have anything to do with XDP,
>> which indicates that the 'xdp_storage' name misleading. Maybe
>> 'bpf_net_context' or something along those lines? Or maybe we could just
>> move the flush lists into bpf_redirect_info itself and just keep that as
>> the top-level name?
>
> I'm going to rename it for now as suggested for now. If it is a better
> fit to include the lists into bpf_redirect_info then I will update
> accordingly.
OK, SGTM.
-Toke
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
2024-02-14 16:36 ` Sebastian Andrzej Siewior
@ 2024-02-15 20:23 ` Toke Høiland-Jørgensen
2024-02-16 16:57 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 25+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-02-15 20:23 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: Jesper Dangaard Brouer, bpf, netdev, Björn Töpel,
David S. Miller, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Eric Dumazet, Hao Luo, Jakub Kicinski, Jiri Olsa,
John Fastabend, Jonathan Lemon, KP Singh, Maciej Fijalkowski,
Magnus Karlsson, Martin KaFai Lau, Paolo Abeni, Peter Zijlstra,
Song Liu, Stanislav Fomichev, Thomas Gleixner, Yonghong Song
Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:
> On 2024-02-14 17:08:44 [+0100], Toke Høiland-Jørgensen wrote:
>> > During testing I forgot a spot in egress and the test module. You could
>> > argue that the warning is enough since it should pop up in testing and
>> > not production because the code is always missed and not by chance (go
>> > boom, send a report). I *think* I covered all spots, at least the test
>> > suite didn't point anything out to me.
>>
>> Well, I would prefer if we could make sure we covered everything and not
>> have this odd failure mode where redirect just mysteriously stops
>> working. At the very least, if we keep the check we should have a
>> WARN_ON in there to make it really obvious that something needs to be
>> fixed.
>
> Agree.
>
>> This brings me to another thing I was going to point out separately, but
>> may as well mention it here: It would be good if we could keep the
>> difference between the RT and !RT versions as small as possible to avoid
>> having subtle bugs that only appear in one configuration.
>
> Yes. I do so, too.
>
>> I agree with Jesper that the concept of a stack-allocated "run context"
>> for the XDP program makes sense in general (and I have some vague ideas
>> about other things that may be useful to stick in there). So I'm
>> wondering if it makes sense to do that even in the !RT case? We can't
>> stick the pointer to it into 'current' when running in softirq, but we
>> could change the per-cpu variable to just be a pointer that gets
>> populated by xdp_storage_set()?
>
> I *think* that it could be added to current. The assignment currently
> allows nesting so that is not a problem. Only the outer most set/clear
> would do something. If you run in softirq, you would hijack a random
> task_struct. If the pointer is already assigned then the list and so one
> must be empty because access is only allowed in BH-disabled sections.
>
> However, using per-CPU for the pointer (instead of task_struct) would
> have the advantage that it is always CPU/node local memory while the
> random task_struct could have been allocated on a different NUMA node.
Ah yes, good point, that's probably desirable :)
>> I'm not really sure if this would be performance neutral (it's just
>> moving around a few bits of memory, but we do gain an extra pointer
>> deref), but it should be simple enough to benchmark.
>
> My guess is that we remain with one per-CPU dereference and an
> additional "add offset". That is why I kept the !RT bits as they are
> before getting yelled at.
>
> I could prepare something and run a specific test if you have one.
The test itself is simple enough: Simply run xdp-bench (from
xdp-tools[0]) in drop mode, serve some traffic to the machine and
observe the difference in PPS before and after the patch.
The tricky part is that the traffic actually has to stress the CPU,
which means that the offered load has to be higher than what the CPU can
handle. Which generally means running on high-speed NICs with small
packets: a modern server CPU has no problem keeping up with a 10G link
even at 64-byte packet size, so a 100G NIC is needed, or the test needs
to be run on a low-powered machine.
As a traffic generator, the xdp-trafficgen utility also in xdp-tools can
be used, or the in-kernel pktgen, or something like T-rex or Moongen.
Generally serving UDP traffic with 64-byte packets on a single port
is enough to make sure the traffic is serviced by a single CPU, although
some configuration may be needed to steer IRQs as well.
-Toke
[0] https://github.com/xdp-project/xdp-tools
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
2024-02-15 20:23 ` Toke Høiland-Jørgensen
@ 2024-02-16 16:57 ` Sebastian Andrzej Siewior
2024-02-19 19:01 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 25+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-02-16 16:57 UTC (permalink / raw)
To: Toke Høiland-Jørgensen
Cc: Jesper Dangaard Brouer, bpf, netdev, Björn Töpel,
David S. Miller, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Eric Dumazet, Hao Luo, Jakub Kicinski, Jiri Olsa,
John Fastabend, Jonathan Lemon, KP Singh, Maciej Fijalkowski,
Magnus Karlsson, Martin KaFai Lau, Paolo Abeni, Peter Zijlstra,
Song Liu, Stanislav Fomichev, Thomas Gleixner, Yonghong Song
On 2024-02-15 21:23:23 [+0100], Toke Høiland-Jørgensen wrote:
> The tricky part is that the traffic actually has to stress the CPU,
> which means that the offered load has to be higher than what the CPU can
> handle. Which generally means running on high-speed NICs with small
> packets: a modern server CPU has no problem keeping up with a 10G link
> even at 64-byte packet size, so a 100G NIC is needed, or the test needs
> to be run on a low-powered machine.
I have 10G box. I can tell cpufreq to go down to 1.1Ghz and I could
reduce the queues to one and hope that it is slow enough.
> As a traffic generator, the xdp-trafficgen utility also in xdp-tools can
> be used, or the in-kernel pktgen, or something like T-rex or Moongen.
> Generally serving UDP traffic with 64-byte packets on a single port
> is enough to make sure the traffic is serviced by a single CPU, although
> some configuration may be needed to steer IRQs as well.
I played with xdp-trafficgen:
| # xdp-trafficgen udp eno2 -v
| Current rlimit is infinity or 0. Not raising
| Kernel supports 5-arg xdp_cpumap_kthread tracepoint
| Error in ethtool ioctl: Operation not supported
| Got -95 queues for ifname lo
| Kernel supports 5-arg xdp_cpumap_kthread tracepoint
| Got 94 queues for ifname eno2
| Transmitting on eno2 (ifindex 3)
| lo->eno2 0 err/s 0 xmit/s
| lo->eno2 0 err/s 0 xmit/s
| lo->eno2 0 err/s 0 xmit/s
I even tried set the MAC address with -M/ -m but nothing happens. I see
and on drop side something happening when I issue a ping command.
Does something ring a bell? Otherwise I try the pktgen. This is a Debian
kernel (just to ensure I didn't break something or forgot a config
switch).
> -Toke
Sebastian
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
2024-02-16 16:57 ` Sebastian Andrzej Siewior
@ 2024-02-19 19:01 ` Toke Høiland-Jørgensen
2024-02-20 9:17 ` Jesper Dangaard Brouer
0 siblings, 1 reply; 25+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-02-19 19:01 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: Jesper Dangaard Brouer, bpf, netdev, Björn Töpel,
David S. Miller, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Eric Dumazet, Hao Luo, Jakub Kicinski, Jiri Olsa,
John Fastabend, Jonathan Lemon, KP Singh, Maciej Fijalkowski,
Magnus Karlsson, Martin KaFai Lau, Paolo Abeni, Peter Zijlstra,
Song Liu, Stanislav Fomichev, Thomas Gleixner, Yonghong Song
Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:
> On 2024-02-15 21:23:23 [+0100], Toke Høiland-Jørgensen wrote:
>> The tricky part is that the traffic actually has to stress the CPU,
>> which means that the offered load has to be higher than what the CPU can
>> handle. Which generally means running on high-speed NICs with small
>> packets: a modern server CPU has no problem keeping up with a 10G link
>> even at 64-byte packet size, so a 100G NIC is needed, or the test needs
>> to be run on a low-powered machine.
>
> I have 10G box. I can tell cpufreq to go down to 1.1Ghz and I could
> reduce the queues to one and hope that it is slow enough.
Yeah, that may work. As long as the baseline performance is below the
~14Mpps that's 10G line rate for small packets.
>> As a traffic generator, the xdp-trafficgen utility also in xdp-tools can
>> be used, or the in-kernel pktgen, or something like T-rex or Moongen.
>> Generally serving UDP traffic with 64-byte packets on a single port
>> is enough to make sure the traffic is serviced by a single CPU, although
>> some configuration may be needed to steer IRQs as well.
>
> I played with xdp-trafficgen:
> | # xdp-trafficgen udp eno2 -v
> | Current rlimit is infinity or 0. Not raising
> | Kernel supports 5-arg xdp_cpumap_kthread tracepoint
> | Error in ethtool ioctl: Operation not supported
> | Got -95 queues for ifname lo
> | Kernel supports 5-arg xdp_cpumap_kthread tracepoint
> | Got 94 queues for ifname eno2
> | Transmitting on eno2 (ifindex 3)
> | lo->eno2 0 err/s 0 xmit/s
> | lo->eno2 0 err/s 0 xmit/s
> | lo->eno2 0 err/s 0 xmit/s
>
> I even tried set the MAC address with -M/ -m but nothing happens. I see
> and on drop side something happening when I issue a ping command.
> Does something ring a bell? Otherwise I try the pktgen. This is a Debian
> kernel (just to ensure I didn't break something or forgot a config
> switch).
Hmm, how old a kernel? And on what hardware? xdp-trafficgen requires a
relatively new kernel, and the driver needs to support XDP_REDIRECT. It
may be simpler to use pktgen, and at 10G rates that shouldn't become a
bottleneck either. The pktgen_sample03_burst_single_flow.sh script in
samples/pktgen in the kernel source tree is fine for this usage.
-Toke
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
2024-02-19 19:01 ` Toke Høiland-Jørgensen
@ 2024-02-20 9:17 ` Jesper Dangaard Brouer
2024-02-20 10:17 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 25+ messages in thread
From: Jesper Dangaard Brouer @ 2024-02-20 9:17 UTC (permalink / raw)
To: Toke Høiland-Jørgensen, Sebastian Andrzej Siewior; +Cc: bpf, netdev
On 19/02/2024 20.01, Toke Høiland-Jørgensen wrote:
> may be simpler to use pktgen, and at 10G rates that shouldn't become a
> bottleneck either. The pktgen_sample03_burst_single_flow.sh script in
> samples/pktgen in the kernel source tree is fine for this usage.
Example of running script:
./pktgen_sample03_burst_single_flow.sh -vi mlx5p1 -d 198.18.1.1 -m
ec:0d:9a:db:11:c4 -t 12
Notice the last parameter, which is number threads to start.
If you have a ixgbe NIC driver, then I recommend -t 2 even if you have
more CPUs.
--Jesper
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
2024-02-20 9:17 ` Jesper Dangaard Brouer
@ 2024-02-20 10:17 ` Sebastian Andrzej Siewior
2024-02-20 10:42 ` Jesper Dangaard Brouer
0 siblings, 1 reply; 25+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-02-20 10:17 UTC (permalink / raw)
To: Jesper Dangaard Brouer; +Cc: Toke Høiland-Jørgensen, bpf, netdev
On 2024-02-20 10:17:53 [+0100], Jesper Dangaard Brouer wrote:
>
>
> On 19/02/2024 20.01, Toke Høiland-Jørgensen wrote:
> > may be simpler to use pktgen, and at 10G rates that shouldn't become a
> > bottleneck either. The pktgen_sample03_burst_single_flow.sh script in
> > samples/pktgen in the kernel source tree is fine for this usage.
>
> Example of running script:
> ./pktgen_sample03_burst_single_flow.sh -vi mlx5p1 -d 198.18.1.1 -m
> ec:0d:9a:db:11:c4 -t 12
>
> Notice the last parameter, which is number threads to start.
> If you have a ixgbe NIC driver, then I recommend -t 2 even if you have more
> CPUs.
I get
| Summary 8,435,690 rx/s 0 err/s
| Summary 8,436,294 rx/s 0 err/s
with "-t 8 -b 64". I started with 2 and then increased until rx/s was
falling again. I have ixgbe on the sending side and i40e on the
receiving side. I tried to receive on ixgbe but this ended with -ENOMEM
| # xdp-bench drop eth1
| Failed to attach XDP program: Cannot allocate memory
This is v6.8-rc5 on both sides. Let me see where this is coming from…
> --Jesper
Sebastian
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
2024-02-20 10:17 ` Sebastian Andrzej Siewior
@ 2024-02-20 10:42 ` Jesper Dangaard Brouer
2024-02-20 12:08 ` Sebastian Andrzej Siewior
2024-02-20 12:10 ` Dave Taht
0 siblings, 2 replies; 25+ messages in thread
From: Jesper Dangaard Brouer @ 2024-02-20 10:42 UTC (permalink / raw)
To: Sebastian Andrzej Siewior; +Cc: Toke Høiland-Jørgensen, bpf, netdev
On 20/02/2024 11.17, Sebastian Andrzej Siewior wrote:
> On 2024-02-20 10:17:53 [+0100], Jesper Dangaard Brouer wrote:
>>
>>
>> On 19/02/2024 20.01, Toke Høiland-Jørgensen wrote:
>>> may be simpler to use pktgen, and at 10G rates that shouldn't become a
>>> bottleneck either. The pktgen_sample03_burst_single_flow.sh script in
>>> samples/pktgen in the kernel source tree is fine for this usage.
>>
>> Example of running script:
>> ./pktgen_sample03_burst_single_flow.sh -vi mlx5p1 -d 198.18.1.1 -m
>> ec:0d:9a:db:11:c4 -t 12
>>
>> Notice the last parameter, which is number threads to start.
>> If you have a ixgbe NIC driver, then I recommend -t 2 even if you have more
>> CPUs.
>
> I get
> | Summary 8,435,690 rx/s 0 err/s
This seems low...
Have you remembered to disable Ethernet flow-control?
# ethtool -A ixgbe1 rx off tx off
# ethtool -A i40e2 rx off tx off
> | Summary 8,436,294 rx/s 0 err/s
You want to see the "extended" info via cmdline (or Ctrl+\)
# xdp-bench drop -e eth1
>
> with "-t 8 -b 64". I started with 2 and then increased until rx/s was
> falling again. I have ixgbe on the sending side and i40e on the
With ixgbe on the sending side, my testlab shows I need -t 2.
With -t 2 :
Summary 14,678,170 rx/s 0 err/s
receive total 14,678,170 pkt/s 14,678,170 drop/s
0 error/s
cpu:1 14,678,170 pkt/s 14,678,170 drop/s
0 error/s
xdp_exception 0 hit/s
with -t 4:
Summary 10,255,385 rx/s 0 err/s
receive total 10,255,385 pkt/s 10,255,385 drop/s
0 error/s
cpu:1 10,255,385 pkt/s 10,255,385 drop/s
0 error/s
xdp_exception 0 hit/s
> receiving side. I tried to receive on ixgbe but this ended with -ENOMEM
> | # xdp-bench drop eth1
> | Failed to attach XDP program: Cannot allocate memory
>
> This is v6.8-rc5 on both sides. Let me see where this is coming from…
>
Another pitfall with ixgbe is that it does a full link reset when
adding/removing XDP prog on device. This can be annoying if connected
back-to-back, because "remote" pktgen will stop on link reset.
--Jesper
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
2024-02-20 10:42 ` Jesper Dangaard Brouer
@ 2024-02-20 12:08 ` Sebastian Andrzej Siewior
2024-02-20 12:57 ` Jesper Dangaard Brouer
2024-02-20 12:10 ` Dave Taht
1 sibling, 1 reply; 25+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-02-20 12:08 UTC (permalink / raw)
To: Jesper Dangaard Brouer; +Cc: Toke Høiland-Jørgensen, bpf, netdev
On 2024-02-20 11:42:57 [+0100], Jesper Dangaard Brouer wrote:
> This seems low...
> Have you remembered to disable Ethernet flow-control?
No but one side says:
| i40e 0000:3d:00.1 eno2np1: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None
but I did this
> # ethtool -A ixgbe1 rx off tx off
> # ethtool -A i40e2 rx off tx off
and it didn't change much.
>
> > | Summary 8,436,294 rx/s 0 err/s
>
> You want to see the "extended" info via cmdline (or Ctrl+\)
>
> # xdp-bench drop -e eth1
>
>
> >
> > with "-t 8 -b 64". I started with 2 and then increased until rx/s was
> > falling again. I have ixgbe on the sending side and i40e on the
>
> With ixgbe on the sending side, my testlab shows I need -t 2.
>
> With -t 2 :
> Summary 14,678,170 rx/s 0 err/s
> receive total 14,678,170 pkt/s 14,678,170 drop/s 0
> error/s
> cpu:1 14,678,170 pkt/s 14,678,170 drop/s 0
> error/s
> xdp_exception 0 hit/s
>
> with -t 4:
>
> Summary 10,255,385 rx/s 0 err/s
> receive total 10,255,385 pkt/s 10,255,385 drop/s 0
> error/s
> cpu:1 10,255,385 pkt/s 10,255,385 drop/s 0
> error/s
> xdp_exception 0 hit/s
>
> > receiving side. I tried to receive on ixgbe but this ended with -ENOMEM
> > | # xdp-bench drop eth1
> > | Failed to attach XDP program: Cannot allocate memory
> >
> > This is v6.8-rc5 on both sides. Let me see where this is coming from…
> >
>
> Another pitfall with ixgbe is that it does a full link reset when
> adding/removing XDP prog on device. This can be annoying if connected
> back-to-back, because "remote" pktgen will stop on link reset.
so I replaced nr_cpu_ids with 64 and bootet maxcpus=64 so that I can run
xdp-bench on the ixbe.
so. i40 send, ixgbe receive.
-t 2
| Summary 2,348,800 rx/s 0 err/s
| receive total 2,348,800 pkt/s 2,348,800 drop/s 0 error/s
| cpu:0 2,348,800 pkt/s 2,348,800 drop/s 0 error/s
| xdp_exception 0 hit/s
-t 4
| Summary 4,158,199 rx/s 0 err/s
| receive total 4,158,199 pkt/s 4,158,199 drop/s 0 error/s
| cpu:0 4,158,199 pkt/s 4,158,199 drop/s 0 error/s
| xdp_exception 0 hit/s
-t 8
| Summary 5,612,861 rx/s 0 err/s
| receive total 5,612,861 pkt/s 5,612,861 drop/s 0 error/s
| cpu:0 5,612,861 pkt/s 5,612,861 drop/s 0 error/s
| xdp_exception 0 hit/s
going higher makes the rate drop. With 8 it floats between 5,5… 5,7…
Doing "ethtool -G eno2np1 tx 4096 rx 4096" on the i40 makes it worse,
using the default 512/512 gets the numbers from above, going below 256
makes it worse.
receiving on i40, sending on ixgbe:
-t 2
|Summary 3,042,957 rx/s 0 err/s
| receive total 3,042,957 pkt/s 3,042,957 drop/s 0 error/s
| cpu:60 3,042,957 pkt/s 3,042,957 drop/s 0 error/s
| xdp_exception 0 hit/s
-t 4
|Summary 5,442,166 rx/s 0 err/s
| receive total 5,442,166 pkt/s 5,442,166 drop/s 0 error/s
| cpu:60 5,442,166 pkt/s 5,442,166 drop/s 0 error/s
| xdp_exception 0 hit/s
-t 6
| Summary 7,023,406 rx/s 0 err/s
| receive total 7,023,406 pkt/s 7,023,406 drop/s 0 error/s
| cpu:60 7,023,406 pkt/s 7,023,406 drop/s 0 error/s
| xdp_exception 0 hit/s
-t 8
| Summary 7,540,915 rx/s 0 err/s
| receive total 7,540,915 pkt/s 7,540,915 drop/s 0 error/s
| cpu:60 7,540,915 pkt/s 7,540,915 drop/s 0 error/s
| xdp_exception 0 hit/s
-t 10
|Summary 7,699,143 rx/s 0 err/s
| receive total 7,699,143 pkt/s 7,699,143 drop/s 0 error/s
| cpu:60 7,699,143 pkt/s 7,699,143 drop/s 0 error/s
| xdp_exception 0 hit/s
-t 18
| Summary 7,784,946 rx/s 0 err/s
| receive total 7,784,946 pkt/s 7,784,946 drop/s 0 error/s
| cpu:60 7,784,946 pkt/s 7,784,946 drop/s 0 error/s
| xdp_exception 0 hit/s
after t18 it drop down to 2,…
Now I got worse than before since -t8 says 7,5… and it did 8,4 in the
morning. Do you have maybe a .config for me in case I did not enable the
performance switch?
> --Jesper
Sebastian
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
2024-02-20 10:42 ` Jesper Dangaard Brouer
2024-02-20 12:08 ` Sebastian Andrzej Siewior
@ 2024-02-20 12:10 ` Dave Taht
1 sibling, 0 replies; 25+ messages in thread
From: Dave Taht @ 2024-02-20 12:10 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Sebastian Andrzej Siewior, Toke Høiland-Jørgensen, bpf,
netdev
I am actually rather interested in what granularity ethernet flow
control takes place at these days at these speeds. Anyone know?
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
2024-02-20 12:08 ` Sebastian Andrzej Siewior
@ 2024-02-20 12:57 ` Jesper Dangaard Brouer
2024-02-20 15:32 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 25+ messages in thread
From: Jesper Dangaard Brouer @ 2024-02-20 12:57 UTC (permalink / raw)
To: Sebastian Andrzej Siewior; +Cc: Toke Høiland-Jørgensen, bpf, netdev
On 20/02/2024 13.08, Sebastian Andrzej Siewior wrote:
> On 2024-02-20 11:42:57 [+0100], Jesper Dangaard Brouer wrote:
>> This seems low...
>> Have you remembered to disable Ethernet flow-control?
>
> No but one side says:
> | i40e 0000:3d:00.1 eno2np1: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None
>
> but I did this
>
>> # ethtool -A ixgbe1 rx off tx off
>> # ethtool -A i40e2 rx off tx off
>
> and it didn't change much.
>
>>
>>> | Summary 8,436,294 rx/s 0 err/s
>>
>> You want to see the "extended" info via cmdline (or Ctrl+\)
>>
>> # xdp-bench drop -e eth1
>>
>>
>>>
>>> with "-t 8 -b 64". I started with 2 and then increased until rx/s was
>>> falling again. I have ixgbe on the sending side and i40e on the
>>
>> With ixgbe on the sending side, my testlab shows I need -t 2.
>>
>> With -t 2 :
>> Summary 14,678,170 rx/s 0 err/s
>> receive total 14,678,170 pkt/s 14,678,170 drop/s 0
>> error/s
>> cpu:1 14,678,170 pkt/s 14,678,170 drop/s 0
>> error/s
>> xdp_exception 0 hit/s
>>
>> with -t 4:
>>
>> Summary 10,255,385 rx/s 0 err/s
>> receive total 10,255,385 pkt/s 10,255,385 drop/s 0
>> error/s
>> cpu:1 10,255,385 pkt/s 10,255,385 drop/s 0
>> error/s
>> xdp_exception 0 hit/s
>>
>>> receiving side. I tried to receive on ixgbe but this ended with -ENOMEM
>>> | # xdp-bench drop eth1
>>> | Failed to attach XDP program: Cannot allocate memory
>>>
>>> This is v6.8-rc5 on both sides. Let me see where this is coming from…
>>>
>>
>> Another pitfall with ixgbe is that it does a full link reset when
>> adding/removing XDP prog on device. This can be annoying if connected
>> back-to-back, because "remote" pktgen will stop on link reset.
>
> so I replaced nr_cpu_ids with 64 and bootet maxcpus=64 so that I can run
> xdp-bench on the ixgbe.
>
Yes, ixgbe HW have limited TX queues, and XDP tries to allocate a
hardware TX queue for every CPU in the system. So, I guess you have too
many CPUs in your system - lol.
Other drivers have a fallback to a locked XDP TX path, so this is also
something to lookout for in the machine with i40e.
> so. i40 send, ixgbe receive.
>
> -t 2
>
> | Summary 2,348,800 rx/s 0 err/s
> | receive total 2,348,800 pkt/s 2,348,800 drop/s 0 error/s
> | cpu:0 2,348,800 pkt/s 2,348,800 drop/s 0 error/s
> | xdp_exception 0 hit/s
>
This is way too low, with i40e sending.
On my system with only -t 1 my i40e driver can send with approx 15Mpps:
Ethtool(i40e2) stat: 15028585 ( 15,028,585) <= tx-0.packets /sec
Ethtool(i40e2) stat: 15028589 ( 15,028,589) <= tx_packets /sec
> -t 4
> | Summary 4,158,199 rx/s 0 err/s
> | receive total 4,158,199 pkt/s 4,158,199 drop/s 0 error/s
> | cpu:0 4,158,199 pkt/s 4,158,199 drop/s 0 error/s
> | xdp_exception 0 hit/s
>
Do notice that this is all hitting CPU:0
(this is by design from pktgen_sample03_burst_single_flow.sh)
> -t 8
> | Summary 5,612,861 rx/s 0 err/s
> | receive total 5,612,861 pkt/s 5,612,861 drop/s 0 error/s
> | cpu:0 5,612,861 pkt/s 5,612,861 drop/s 0 error/s
> | xdp_exception 0 hit/s
>
> going higher makes the rate drop. With 8 it floats between 5,5… 5,7…
>
At -t 8 we seem to have hit limit on RX side, which also seems too low.
I recommend checking what speeds packet generator is sending with.
I use this tool: ethtool_stats.pl
https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
What we are basically asking: Make sure packet generator is not the
limiting factor in your tests.
As we need to make sure DUT (Device Under Test) is being overloaded
100%. I often also check (via per record) that DUT don't have idle CPU
cycles (yes, this can easily happen... and happens when we hit a limit
in hardware either NIC or PCIe slot)
> Doing "ethtool -G eno2np1 tx 4096 rx 4096" on the i40 makes it worse,
> using the default 512/512 gets the numbers from above, going below 256
> makes it worse.
>
> receiving on i40, sending on ixgbe:
>
> -t 2
> |Summary 3,042,957 rx/s 0 err/s
> | receive total 3,042,957 pkt/s 3,042,957 drop/s 0 error/s
> | cpu:60 3,042,957 pkt/s 3,042,957 drop/s 0 error/s
> | xdp_exception 0 hit/s
>
> -t 4
> |Summary 5,442,166 rx/s 0 err/s
> | receive total 5,442,166 pkt/s 5,442,166 drop/s 0 error/s
> | cpu:60 5,442,166 pkt/s 5,442,166 drop/s 0 error/s
> | xdp_exception 0 hit/s
>
>
> -t 6
> | Summary 7,023,406 rx/s 0 err/s
> | receive total 7,023,406 pkt/s 7,023,406 drop/s 0 error/s
> | cpu:60 7,023,406 pkt/s 7,023,406 drop/s 0 error/s
> | xdp_exception 0 hit/s
>
>
> -t 8
> | Summary 7,540,915 rx/s 0 err/s
> | receive total 7,540,915 pkt/s 7,540,915 drop/s 0 error/s
> | cpu:60 7,540,915 pkt/s 7,540,915 drop/s 0 error/s
> | xdp_exception 0 hit/s
>
> -t 10
> |Summary 7,699,143 rx/s 0 err/s
> | receive total 7,699,143 pkt/s 7,699,143 drop/s 0 error/s
> | cpu:60 7,699,143 pkt/s 7,699,143 drop/s 0 error/s
> | xdp_exception 0 hit/s
>
At this level, if you can verify that CPU:60 is 100% loaded, and packet
generator is sending more than rx number, then it could work as a valid
experiment.
> -t 18
> | Summary 7,784,946 rx/s 0 err/s
> | receive total 7,784,946 pkt/s 7,784,946 drop/s 0 error/s
> | cpu:60 7,784,946 pkt/s 7,784,946 drop/s 0 error/s
> | xdp_exception 0 hit/s
>
> after t18 it drop down to 2,…
> Now I got worse than before since -t8 says 7,5… and it did 8,4 in the
> morning. Do you have maybe a .config for me in case I did not enable the
> performance switch?
>
I would look for root-cause with perf record +
perf report --sort cpu,comm,dso,symbol --no-children
--Jesper
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
2024-02-20 12:57 ` Jesper Dangaard Brouer
@ 2024-02-20 15:32 ` Sebastian Andrzej Siewior
2024-02-22 9:22 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 25+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-02-20 15:32 UTC (permalink / raw)
To: Jesper Dangaard Brouer; +Cc: Toke Høiland-Jørgensen, bpf, netdev
On 2024-02-20 13:57:24 [+0100], Jesper Dangaard Brouer wrote:
> > so I replaced nr_cpu_ids with 64 and bootet maxcpus=64 so that I can run
> > xdp-bench on the ixgbe.
> >
>
> Yes, ixgbe HW have limited TX queues, and XDP tries to allocate a
> hardware TX queue for every CPU in the system. So, I guess you have too
> many CPUs in your system - lol.
>
> Other drivers have a fallback to a locked XDP TX path, so this is also
> something to lookout for in the machine with i40e.
this locked XDP TX path starts at 64 but xdp_progs are rejected > 64 * 2.
> > so. i40 send, ixgbe receive.
> >
> > -t 2
> >
> > | Summary 2,348,800 rx/s 0 err/s
> > | receive total 2,348,800 pkt/s 2,348,800 drop/s 0 error/s
> > | cpu:0 2,348,800 pkt/s 2,348,800 drop/s 0 error/s
> > | xdp_exception 0 hit/s
> >
>
> This is way too low, with i40e sending.
>
> On my system with only -t 1 my i40e driver can send with approx 15Mpps:
>
> Ethtool(i40e2) stat: 15028585 ( 15,028,585) <= tx-0.packets /sec
> Ethtool(i40e2) stat: 15028589 ( 15,028,589) <= tx_packets /sec
-t1 in ixgbe
Show adapter(s) (eth1) statistics (ONLY that changed!)
Ethtool(eth1 ) stat: 107857263 ( 107,857,263) <= tx_bytes /sec
Ethtool(eth1 ) stat: 115047684 ( 115,047,684) <= tx_bytes_nic /sec
Ethtool(eth1 ) stat: 1797621 ( 1,797,621) <= tx_packets /sec
Ethtool(eth1 ) stat: 1797636 ( 1,797,636) <= tx_pkts_nic /sec
Ethtool(eth1 ) stat: 107857263 ( 107,857,263) <= tx_queue_0_bytes /sec
Ethtool(eth1 ) stat: 1797621 ( 1,797,621) <= tx_queue_0_packets /sec
-t i40e
Ethtool(eno2np1 ) stat: 90 ( 90) <= port.rx_bytes /sec
Ethtool(eno2np1 ) stat: 1 ( 1) <= port.rx_size_127 /sec
Ethtool(eno2np1 ) stat: 1 ( 1) <= port.rx_unicast /sec
Ethtool(eno2np1 ) stat: 79554379 ( 79,554,379) <= port.tx_bytes /sec
Ethtool(eno2np1 ) stat: 1243037 ( 1,243,037) <= port.tx_size_64 /sec
Ethtool(eno2np1 ) stat: 1243037 ( 1,243,037) <= port.tx_unicast /sec
Ethtool(eno2np1 ) stat: 86 ( 86) <= rx-32.bytes /sec
Ethtool(eno2np1 ) stat: 1 ( 1) <= rx-32.packets /sec
Ethtool(eno2np1 ) stat: 86 ( 86) <= rx_bytes /sec
Ethtool(eno2np1 ) stat: 1 ( 1) <= rx_cache_waive /sec
Ethtool(eno2np1 ) stat: 1 ( 1) <= rx_packets /sec
Ethtool(eno2np1 ) stat: 1 ( 1) <= rx_unicast /sec
Ethtool(eno2np1 ) stat: 74580821 ( 74,580,821) <= tx-0.bytes /sec
Ethtool(eno2np1 ) stat: 1243014 ( 1,243,014) <= tx-0.packets /sec
Ethtool(eno2np1 ) stat: 74580821 ( 74,580,821) <= tx_bytes /sec
Ethtool(eno2np1 ) stat: 1243014 ( 1,243,014) <= tx_packets /sec
Ethtool(eno2np1 ) stat: 1243037 ( 1,243,037) <= tx_unicast /sec
mine is slightly slower. But this seems to match what I see on the RX
side.
> At this level, if you can verify that CPU:60 is 100% loaded, and packet
> generator is sending more than rx number, then it could work as a valid
> experiment.
i40e receiving on 8:
%Cpu8 : 0.0 us, 0.0 sy, 0.0 ni, 84.8 id, 0.0 wa, 0.0 hi, 15.2 si, 0.0 st
ixgbe receiving on 13:
%Cpu13 : 0.0 us, 0.0 sy, 0.0 ni, 56.7 id, 0.0 wa, 0.0 hi, 43.3 si, 0.0 st
looks idle. On the sending side kpktgend_0 is always at 100%.
> > -t 18
> > | Summary 7,784,946 rx/s 0 err/s
> > | receive total 7,784,946 pkt/s 7,784,946 drop/s 0 error/s
> > | cpu:60 7,784,946 pkt/s 7,784,946 drop/s 0 error/s
> > | xdp_exception 0 hit/s
> >
> > after t18 it drop down to 2,…
> > Now I got worse than before since -t8 says 7,5… and it did 8,4 in the
> > morning. Do you have maybe a .config for me in case I did not enable the
> > performance switch?
> >
>
> I would look for root-cause with perf record +
> perf report --sort cpu,comm,dso,symbol --no-children
while sending with ixgbe while running perf top on the box:
| Samples: 621K of event 'cycles', 4000 Hz, Event count (approx.): 49979376685 lost: 0/0 drop: 0/0
| Overhead CPU Command Shared Object Symbol
| 31.98% 000 kpktgend_0 [kernel] [k] xas_find
| 6.72% 000 kpktgend_0 [kernel] [k] pfn_to_dma_pte
| 5.63% 000 kpktgend_0 [kernel] [k] ixgbe_xmit_frame_ring
| 4.78% 000 kpktgend_0 [kernel] [k] dma_pte_clear_level
| 3.16% 000 kpktgend_0 [kernel] [k] __iommu_dma_unmap
| 2.30% 000 kpktgend_0 [kernel] [k] fq_ring_free_locked
| 1.99% 000 kpktgend_0 [kernel] [k] __domain_mapping
| 1.82% 000 kpktgend_0 [kernel] [k] iommu_dma_alloc_iova
| 1.80% 000 kpktgend_0 [kernel] [k] __iommu_map
| 1.72% 000 kpktgend_0 [kernel] [k] iommu_pgsize.isra.0
| 1.70% 000 kpktgend_0 [kernel] [k] __iommu_dma_map
| 1.63% 000 kpktgend_0 [kernel] [k] alloc_iova_fast
| 1.59% 000 kpktgend_0 [kernel] [k] _raw_spin_lock_irqsave
| 1.32% 000 kpktgend_0 [kernel] [k] iommu_map
| 1.30% 000 kpktgend_0 [kernel] [k] iommu_dma_map_page
| 1.23% 000 kpktgend_0 [kernel] [k] intel_iommu_iotlb_sync_map
| 1.21% 000 kpktgend_0 [kernel] [k] xa_find_after
| 1.17% 000 kpktgend_0 [kernel] [k] ixgbe_poll
| 1.06% 000 kpktgend_0 [kernel] [k] __iommu_unmap
| 1.04% 000 kpktgend_0 [kernel] [k] intel_iommu_unmap_pages
| 1.01% 000 kpktgend_0 [kernel] [k] free_iova_fast
| 0.96% 000 kpktgend_0 [pktgen] [k] pktgen_thread_worker
the i40e box while sending:
|Samples: 400K of event 'cycles:P', 4000 Hz, Event count (approx.): 80512443924 lost: 0/0 drop: 0/0
|Overhead CPU Command Shared Object Symbol
| 24.04% 000 kpktgend_0 [kernel] [k] i40e_lan_xmit_frame
| 17.20% 019 swapper [kernel] [k] i40e_napi_poll
| 4.84% 019 swapper [kernel] [k] intel_idle_irq
| 4.20% 019 swapper [kernel] [k] napi_consume_skb
| 3.00% 000 kpktgend_0 [pktgen] [k] pktgen_thread_worker
| 2.76% 008 swapper [kernel] [k] i40e_napi_poll
| 2.36% 000 kpktgend_0 [kernel] [k] dma_map_page_attrs
| 1.93% 019 swapper [kernel] [k] dma_unmap_page_attrs
| 1.70% 008 swapper [kernel] [k] intel_idle_irq
| 1.44% 008 swapper [kernel] [k] __udp4_lib_rcv
| 1.44% 008 swapper [kernel] [k] __netif_receive_skb_core.constprop.0
| 1.40% 008 swapper [kernel] [k] napi_build_skb
| 1.28% 000 kpktgend_0 [kernel] [k] kfree_skb_reason
| 1.27% 008 swapper [kernel] [k] ip_rcv_core
| 1.19% 008 swapper [kernel] [k] inet_gro_receive
| 1.01% 008 swapper [kernel] [k] kmem_cache_free.part.0
> --Jesper
Sebastian
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
2024-02-20 15:32 ` Sebastian Andrzej Siewior
@ 2024-02-22 9:22 ` Sebastian Andrzej Siewior
2024-02-22 10:10 ` Jesper Dangaard Brouer
0 siblings, 1 reply; 25+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-02-22 9:22 UTC (permalink / raw)
To: Jesper Dangaard Brouer; +Cc: Toke Høiland-Jørgensen, bpf, netdev
On 2024-02-20 16:32:08 [+0100], To Jesper Dangaard Brouer wrote:
> >
> > Ethtool(i40e2) stat: 15028585 ( 15,028,585) <= tx-0.packets /sec
> > Ethtool(i40e2) stat: 15028589 ( 15,028,589) <= tx_packets /sec
>
> -t1 in ixgbe
> Show adapter(s) (eth1) statistics (ONLY that changed!)
> Ethtool(eth1 ) stat: 107857263 ( 107,857,263) <= tx_bytes /sec
> Ethtool(eth1 ) stat: 115047684 ( 115,047,684) <= tx_bytes_nic /sec
> Ethtool(eth1 ) stat: 1797621 ( 1,797,621) <= tx_packets /sec
> Ethtool(eth1 ) stat: 1797636 ( 1,797,636) <= tx_pkts_nic /sec
> Ethtool(eth1 ) stat: 107857263 ( 107,857,263) <= tx_queue_0_bytes /sec
> Ethtool(eth1 ) stat: 1797621 ( 1,797,621) <= tx_queue_0_packets /sec
…
> while sending with ixgbe while running perf top on the box:
> | Samples: 621K of event 'cycles', 4000 Hz, Event count (approx.): 49979376685 lost: 0/0 drop: 0/0
> | Overhead CPU Command Shared Object Symbol
> | 31.98% 000 kpktgend_0 [kernel] [k] xas_find
> | 6.72% 000 kpktgend_0 [kernel] [k] pfn_to_dma_pte
> | 5.63% 000 kpktgend_0 [kernel] [k] ixgbe_xmit_frame_ring
> | 4.78% 000 kpktgend_0 [kernel] [k] dma_pte_clear_level
> | 3.16% 000 kpktgend_0 [kernel] [k] __iommu_dma_unmap
I disabled the iommu and I get to
Ethtool(eth1 ) stat: 14158562 ( 14,158,562) <= tx_packets /sec
Ethtool(eth1 ) stat: 14158685 ( 14,158,685) <= tx_pkts_nic /sec
looks like a small improvement… It is not your 15 but close. -t2 does
improve the situation. There is a warning from DMA mapping code but ;)
> > --Jesper
Sebastian
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
2024-02-22 9:22 ` Sebastian Andrzej Siewior
@ 2024-02-22 10:10 ` Jesper Dangaard Brouer
2024-02-22 10:58 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 25+ messages in thread
From: Jesper Dangaard Brouer @ 2024-02-22 10:10 UTC (permalink / raw)
To: Sebastian Andrzej Siewior; +Cc: Toke Høiland-Jørgensen, bpf, netdev
On 22/02/2024 10.22, Sebastian Andrzej Siewior wrote:
> On 2024-02-20 16:32:08 [+0100], To Jesper Dangaard Brouer wrote:
>>>
>>> Ethtool(i40e2) stat: 15028585 ( 15,028,585) <= tx-0.packets /sec
>>> Ethtool(i40e2) stat: 15028589 ( 15,028,589) <= tx_packets /sec
>>
>> -t1 in ixgbe
>> Show adapter(s) (eth1) statistics (ONLY that changed!)
>> Ethtool(eth1 ) stat: 107857263 ( 107,857,263) <= tx_bytes /sec
>> Ethtool(eth1 ) stat: 115047684 ( 115,047,684) <= tx_bytes_nic /sec
>> Ethtool(eth1 ) stat: 1797621 ( 1,797,621) <= tx_packets /sec
>> Ethtool(eth1 ) stat: 1797636 ( 1,797,636) <= tx_pkts_nic /sec
>> Ethtool(eth1 ) stat: 107857263 ( 107,857,263) <= tx_queue_0_bytes /sec
>> Ethtool(eth1 ) stat: 1797621 ( 1,797,621) <= tx_queue_0_packets /sec
> …
>> while sending with ixgbe while running perf top on the box:
>> | Samples: 621K of event 'cycles', 4000 Hz, Event count (approx.): 49979376685 lost: 0/0 drop: 0/0
>> | Overhead CPU Command Shared Object Symbol
>> | 31.98% 000 kpktgend_0 [kernel] [k] xas_find
>> | 6.72% 000 kpktgend_0 [kernel] [k] pfn_to_dma_pte
>> | 5.63% 000 kpktgend_0 [kernel] [k] ixgbe_xmit_frame_ring
>> | 4.78% 000 kpktgend_0 [kernel] [k] dma_pte_clear_level
>> | 3.16% 000 kpktgend_0 [kernel] [k] __iommu_dma_unmap
>
> I disabled the iommu and I get to
Yes, clearly IOMMU code that cause the performance issue for you.
This driver doesn't use page_pool, so I want to point out (for people
finding this post in the future) that page_pool keeps DMA mappings for
recycled frame, which should address the IOMMU overhead issue here.
>
> Ethtool(eth1 ) stat: 14158562 ( 14,158,562) <= tx_packets /sec
> Ethtool(eth1 ) stat: 14158685 ( 14,158,685) <= tx_pkts_nic /sec
>
> looks like a small improvement… It is not your 15 but close. -t2 does
> improve the situation.
You cannot reach 15Mpps on 10Gbit/s as wirespeed for 10G is 14.88Mpps.
Congratulations, I think this 14.15 Mpps is as close to wirespeed as it
possible on your hardware.
BTW what CPU are you using?
> There is a warning from DMA mapping code but ;)
It is a warning from IOMMU code?
It usually means there is a real DMA unmap bug (which we should fix).
--Jesper
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
2024-02-22 10:10 ` Jesper Dangaard Brouer
@ 2024-02-22 10:58 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 25+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-02-22 10:58 UTC (permalink / raw)
To: Jesper Dangaard Brouer; +Cc: Toke Høiland-Jørgensen, bpf, netdev
On 2024-02-22 11:10:44 [+0100], Jesper Dangaard Brouer wrote:
> > Ethtool(eth1 ) stat: 14158562 ( 14,158,562) <= tx_packets /sec
> > Ethtool(eth1 ) stat: 14158685 ( 14,158,685) <= tx_pkts_nic /sec
> >
> > looks like a small improvement… It is not your 15 but close. -t2 does
> > improve the situation.
>
> You cannot reach 15Mpps on 10Gbit/s as wirespeed for 10G is 14.88Mpps.
Oh, my bad.
> Congratulations, I think this 14.15 Mpps is as close to wirespeed as it
> possible on your hardware.
>
> BTW what CPU are you using?
"Intel(R) Xeon(R) CPU E7-8890 v3 @ 2.50GHz"
The "performance" governor is used, I lowered the number of CPUs and
disabled SMT.
> > There is a warning from DMA mapping code but ;)
>
> It is a warning from IOMMU code?
> It usually means there is a real DMA unmap bug (which we should fix).
Not sure, I don't think so:
| ------------[ cut here ]------------
| ehci-pci 0000:00:1a.0: DMA addr 0x0000000105016ce8+8 overflow (mask ffffffff, bus limit 0).
| WARNING: CPU: 0 PID: 1029 at kernel/dma/direct.h:105 dma_map_page_attrs+0x1e8/0x1f0
| RIP: 0010:dma_map_page_attrs+0x1e8/0x1f0
| Call Trace:
| <TASK>
| usb_hcd_map_urb_for_dma+0x1b0/0x4d0
| usb_hcd_submit_urb+0x342/0x9b0
| usb_start_wait_urb+0x50/0xc0
| usb_control_msg+0xc8/0x110
| get_bMaxPacketSize0+0x5a/0xb0
and USB isn't working. I *think* it is the "memory above 4G" thing, not
sure where it took the wrong turn.
> --Jesper
Sebastian
^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2024-02-22 10:58 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-13 14:58 [PATCH RFC net-next 0/2] Use per-task storage for XDP-redirects on PREEMPT_RT Sebastian Andrzej Siewior
2024-02-13 14:58 ` [PATCH RFC net-next 1/2] net: Reference bpf_redirect_info via task_struct " Sebastian Andrzej Siewior
2024-02-13 20:50 ` Jesper Dangaard Brouer
2024-02-14 12:19 ` Sebastian Andrzej Siewior
2024-02-14 13:23 ` Toke Høiland-Jørgensen
2024-02-14 14:28 ` Sebastian Andrzej Siewior
2024-02-14 16:08 ` Toke Høiland-Jørgensen
2024-02-14 16:36 ` Sebastian Andrzej Siewior
2024-02-15 20:23 ` Toke Høiland-Jørgensen
2024-02-16 16:57 ` Sebastian Andrzej Siewior
2024-02-19 19:01 ` Toke Høiland-Jørgensen
2024-02-20 9:17 ` Jesper Dangaard Brouer
2024-02-20 10:17 ` Sebastian Andrzej Siewior
2024-02-20 10:42 ` Jesper Dangaard Brouer
2024-02-20 12:08 ` Sebastian Andrzej Siewior
2024-02-20 12:57 ` Jesper Dangaard Brouer
2024-02-20 15:32 ` Sebastian Andrzej Siewior
2024-02-22 9:22 ` Sebastian Andrzej Siewior
2024-02-22 10:10 ` Jesper Dangaard Brouer
2024-02-22 10:58 ` Sebastian Andrzej Siewior
2024-02-20 12:10 ` Dave Taht
2024-02-14 16:13 ` Toke Høiland-Jørgensen
2024-02-15 9:04 ` Sebastian Andrzej Siewior
2024-02-15 12:11 ` Toke Høiland-Jørgensen
2024-02-13 14:58 ` [PATCH RFC net-next 2/2] net: Move per-CPU flush-lists to bpf_xdp_storage " Sebastian Andrzej Siewior
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).