netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 net-next 00/15] locking: Introduce nested-BH locking.
@ 2024-05-29 16:02 Sebastian Andrzej Siewior
  2024-05-29 16:02 ` [PATCH v3 net-next 01/15] locking/local_lock: Introduce guard definition for local_lock Sebastian Andrzej Siewior
                   ` (14 more replies)
  0 siblings, 15 replies; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-05-29 16:02 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon

Disabling bottoms halves acts as per-CPU BKL. On PREEMPT_RT code within
local_bh_disable() section remains preemtible. As a result high prior
tasks (or threaded interrupts) will be blocked by lower-prio task (or
threaded interrupts) which are long running which includes softirq
sections.

The proposed way out is to introduce explicit per-CPU locks for
resources which are protected by local_bh_disable() and use those only
on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds.

The series introduces the infrastructure and converts large parts of
networking which is largest stake holder here. Once this done the
per-CPU lock from local_bh_disable() on PREEMPT_RT can be lifted.

Performance testing. Baseline is net-next as of commit 93bda33046e7a
("Merge branch'net-constify-ctl_table-arguments-of-utility-functions'")
plus v6.10-rc1. A 10GiG link is used between two hosts. The command 
   xdp-bench redirect-cpu --cpu 3 --remote-action drop eth1 -e

was invoked on the receiving side with a ixgbe. The sending side uses
pktgen_sample03_burst_single_flow.sh on i40e.

Baseline:
| eth1->?                 9,018,604 rx/s                  0 err,drop/s
|   receive total         9,018,604 pkt/s                 0 drop/s                0 error/s
|     cpu:7               9,018,604 pkt/s                 0 drop/s                0 error/s
|   enqueue to cpu 3      9,018,602 pkt/s                 0 drop/s             7.00 bulk-avg
|     cpu:7->3            9,018,602 pkt/s                 0 drop/s             7.00 bulk-avg
|   kthread total         9,018,606 pkt/s                 0 drop/s          214,698 sched
|     cpu:3               9,018,606 pkt/s                 0 drop/s          214,698 sched
|     xdp_stats                   0 pass/s        9,018,606 drop/s                0 redir/s
|       cpu:3                     0 pass/s        9,018,606 drop/s                0 redir/s
|   redirect_err                  0 error/s
|   xdp_exception                 0 hit/s

perf top --sort cpu,symbol --no-children:
|   18.14%  007  [k] bpf_prog_4f0ffbb35139c187_cpumap_l4_hash
|   13.29%  007  [k] ixgbe_poll
|   12.66%  003  [k] cpu_map_kthread_run
|    7.23%  003  [k] page_frag_free
|    6.76%  007  [k] xdp_do_redirect
|    3.76%  007  [k] cpu_map_redirect
|    3.13%  007  [k] bq_flush_to_queue
|    2.51%  003  [k] xdp_return_frame
|    1.93%  007  [k] try_to_wake_up
|    1.78%  007  [k] _raw_spin_lock
|    1.74%  007  [k] cpu_map_enqueue
|    1.56%  003  [k] bpf_prog_57cd311f2e27366b_cpumap_drop

With this series applied:
| eth1->?                10,329,340 rx/s                  0 err,drop/s
|   receive total        10,329,340 pkt/s                 0 drop/s                0 error/s
|     cpu:6              10,329,340 pkt/s                 0 drop/s                0 error/s
|   enqueue to cpu 3     10,329,338 pkt/s                 0 drop/s             8.00 bulk-avg
|     cpu:6->3           10,329,338 pkt/s                 0 drop/s             8.00 bulk-avg
|   kthread total        10,329,321 pkt/s                 0 drop/s           96,297 sched
|     cpu:3              10,329,321 pkt/s                 0 drop/s           96,297 sched
|     xdp_stats                   0 pass/s       10,329,321 drop/s                0 redir/s
|       cpu:3                     0 pass/s       10,329,321 drop/s                0 redir/s
|   redirect_err                  0 error/s
|   xdp_exception                 0 hit/s

perf top --sort cpu,symbol --no-children:
|   20.90%  006  [k] bpf_prog_4f0ffbb35139c187_cpumap_l4_hash
|   12.62%  006  [k] ixgbe_poll
|    9.82%  003  [k] page_frag_free
|    8.73%  003  [k] cpu_map_bpf_prog_run_xdp
|    6.63%  006  [k] xdp_do_redirect
|    4.94%  003  [k] cpu_map_kthread_run
|    4.28%  006  [k] cpu_map_redirect
|    4.03%  006  [k] bq_flush_to_queue
|    3.01%  003  [k] xdp_return_frame
|    1.95%  006  [k] _raw_spin_lock
|    1.94%  003  [k] bpf_prog_57cd311f2e27366b_cpumap_drop

This diff appears to be noise.

v2…v3 https://lore.kernel.org/all/20240503182957.1042122-1-bigeasy@linutronix.de/:
- WARN checks checks for bpf_net_ctx_get() have been dropped and all
  NULL checks around it. This means bpf_net_ctx_get_ri() assumes the
  context has been set and will segfault if it is not the case.
  Suggested by Alexei and Jesper. This should always work or always
  segfault.

- It has been suggested by Toke to embed struct bpf_net_context into
  task_struct instead just a pointer to it. This would increase the size
  of task_struct by 112 bytes instead just eight and Alexei didn't like
  it due to the size impact with 1m threads. It is a pointer again.

v1…v2 https://lore.kernel.org/all/20231215171020.687342-1-bigeasy@linutronix.de/:
- Jakub complained about touching networking drivers to make the
  additional locking work. Alexei complained about the additional
  locking within the XDP/eBFP case.
  This led to a change in how the per-CPU variables are accessed for the
  XDP/eBPF case. On PREEMPT_RT the variables are now stored on stack and
  the task pointer to the structure is saved in the task_struct while
  keeping every for !RT unchanged. This was proposed as a RFC in
  	v1: https://lore.kernel.org/all/20240213145923.2552753-1-bigeasy@linutronix.de/

  and then updated

        v2: https://lore.kernel.org/all/20240229183109.646865-1-bigeasy@linutronix.de/
	  - Renamed the container struct from xdp_storage to bpf_net_context.
            Suggested by Toke Høiland-Jørgensen.
	  - Use the container struct also on !PREEMPT_RT builds. Store the
	    pointer to the on-stack struct in a per-CPU variable. Suggested by
            Toke Høiland-Jørgensen.

  This reduces the initial queue from 24 to 15 patches.

- There were complains about the scoped_guard() which shifts the whole
  block and makes it harder to review because the whole gets removed and
  added again. The usage has been replaced with local_lock_nested_bh()+
  its unlock counterpart.

Sebastian


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2024-06-05 11:55 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-29 16:02 [PATCH v3 net-next 00/15] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
2024-05-29 16:02 ` [PATCH v3 net-next 01/15] locking/local_lock: Introduce guard definition for local_lock Sebastian Andrzej Siewior
2024-05-29 16:02 ` [PATCH v3 net-next 02/15] locking/local_lock: Add local nested BH locking infrastructure Sebastian Andrzej Siewior
2024-05-29 16:02 ` [PATCH v3 net-next 03/15] net: Use __napi_alloc_frag_align() instead of open coding it Sebastian Andrzej Siewior
2024-05-29 16:02 ` [PATCH v3 net-next 04/15] net: Use nested-BH locking for napi_alloc_cache Sebastian Andrzej Siewior
2024-05-29 16:02 ` [PATCH v3 net-next 05/15] net/tcp_sigpool: Use nested-BH locking for sigpool_scratch Sebastian Andrzej Siewior
2024-05-29 16:02 ` [PATCH v3 net-next 06/15] net/ipv4: Use nested-BH locking for ipv4_tcp_sk Sebastian Andrzej Siewior
2024-05-29 16:02 ` [PATCH v3 net-next 07/15] netfilter: br_netfilter: Use nested-BH locking for brnf_frag_data_storage Sebastian Andrzej Siewior
2024-05-29 16:02 ` [PATCH v3 net-next 08/15] net: softnet_data: Make xmit.recursion per task Sebastian Andrzej Siewior
2024-05-29 16:02 ` [PATCH v3 net-next 09/15] dev: Remove PREEMPT_RT ifdefs from backlog_lock.*() Sebastian Andrzej Siewior
2024-05-29 16:02 ` [PATCH v3 net-next 10/15] dev: Use nested-BH locking for softnet_data.process_queue Sebastian Andrzej Siewior
2024-05-29 16:02 ` [PATCH v3 net-next 11/15] lwt: Don't disable migration prio invoking BPF Sebastian Andrzej Siewior
2024-05-29 16:02 ` [PATCH v3 net-next 12/15] seg6: Use nested-BH locking for seg6_bpf_srh_states Sebastian Andrzej Siewior
2024-05-29 16:02 ` [PATCH v3 net-next 13/15] net: Use nested-BH locking for bpf_scratchpad Sebastian Andrzej Siewior
2024-05-29 16:02 ` [PATCH v3 net-next 14/15] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT Sebastian Andrzej Siewior
2024-05-29 21:09   ` Alexei Starovoitov
2024-05-29 22:09   ` Toke Høiland-Jørgensen
2024-05-31 10:38     ` Sebastian Andrzej Siewior
2024-06-05 10:28       ` Jesper Dangaard Brouer
2024-06-05 10:37         ` Toke Høiland-Jørgensen
2024-06-05 10:41         ` Sebastian Andrzej Siewior
2024-06-05 11:55           ` Jesper Dangaard Brouer
2024-05-29 16:02 ` [PATCH v3 net-next 15/15] net: Move per-CPU flush-lists to bpf_net_context " Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).