netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 net-next 00/15] locking: Introduce nested-BH locking.
@ 2024-06-18  7:13 Sebastian Andrzej Siewior
  2024-06-18  7:13 ` [PATCH v7 net-next 01/15] locking/local_lock: Introduce guard definition for local_lock Sebastian Andrzej Siewior
                   ` (15 more replies)
  0 siblings, 16 replies; 19+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-06-18  7:13 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Daniel Bristot de Oliveira, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Jakub Kicinski, Paolo Abeni, Peter Zijlstra, Thomas Gleixner,
	Waiman Long, Will Deacon

Disabling bottoms halves acts as per-CPU BKL. On PREEMPT_RT code within
local_bh_disable() section remains preemtible. As a result high prior
tasks (or threaded interrupts) will be blocked by lower-prio task (or
threaded interrupts) which are long running which includes softirq
sections.

The proposed way out is to introduce explicit per-CPU locks for
resources which are protected by local_bh_disable() and use those only
on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds.

The series introduces the infrastructure and converts large parts of
networking which is largest stake holder here. Once this done the
per-CPU lock from local_bh_disable() on PREEMPT_RT can be lifted.

Performance testing. Baseline is net-next as of commit 93bda33046e7a
("Merge branch'net-constify-ctl_table-arguments-of-utility-functions'")
plus v6.10-rc1. A 10GiG link is used between two hosts. The command
   xdp-bench redirect-cpu --cpu 3 --remote-action drop eth1 -e

was invoked on the receiving side with a ixgbe. The sending side uses
pktgen_sample03_burst_single_flow.sh on i40e.

Baseline:
| eth1->?                 9,018,604 rx/s                  0 err,drop/s
|   receive total         9,018,604 pkt/s                 0 drop/s                0 error/s
|     cpu:7               9,018,604 pkt/s                 0 drop/s                0 error/s
|   enqueue to cpu 3      9,018,602 pkt/s                 0 drop/s             7.00 bulk-avg
|     cpu:7->3            9,018,602 pkt/s                 0 drop/s             7.00 bulk-avg
|   kthread total         9,018,606 pkt/s                 0 drop/s          214,698 sched
|     cpu:3               9,018,606 pkt/s                 0 drop/s          214,698 sched
|     xdp_stats                   0 pass/s        9,018,606 drop/s                0 redir/s
|       cpu:3                     0 pass/s        9,018,606 drop/s                0 redir/s
|   redirect_err                  0 error/s
|   xdp_exception                 0 hit/s

perf top --sort cpu,symbol --no-children:
|   18.14%  007  [k] bpf_prog_4f0ffbb35139c187_cpumap_l4_hash
|   13.29%  007  [k] ixgbe_poll
|   12.66%  003  [k] cpu_map_kthread_run
|    7.23%  003  [k] page_frag_free
|    6.76%  007  [k] xdp_do_redirect
|    3.76%  007  [k] cpu_map_redirect
|    3.13%  007  [k] bq_flush_to_queue
|    2.51%  003  [k] xdp_return_frame
|    1.93%  007  [k] try_to_wake_up
|    1.78%  007  [k] _raw_spin_lock
|    1.74%  007  [k] cpu_map_enqueue
|    1.56%  003  [k] bpf_prog_57cd311f2e27366b_cpumap_drop

With this series applied:
| eth1->?                10,329,340 rx/s                  0 err,drop/s
|   receive total        10,329,340 pkt/s                 0 drop/s                0 error/s
|     cpu:6              10,329,340 pkt/s                 0 drop/s                0 error/s
|   enqueue to cpu 3     10,329,338 pkt/s                 0 drop/s             8.00 bulk-avg
|     cpu:6->3           10,329,338 pkt/s                 0 drop/s             8.00 bulk-avg
|   kthread total        10,329,321 pkt/s                 0 drop/s           96,297 sched
|     cpu:3              10,329,321 pkt/s                 0 drop/s           96,297 sched
|     xdp_stats                   0 pass/s       10,329,321 drop/s                0 redir/s
|       cpu:3                     0 pass/s       10,329,321 drop/s                0 redir/s
|   redirect_err                  0 error/s
|   xdp_exception                 0 hit/s

perf top --sort cpu,symbol --no-children:
|   20.90%  006  [k] bpf_prog_4f0ffbb35139c187_cpumap_l4_hash
|   12.62%  006  [k] ixgbe_poll
|    9.82%  003  [k] page_frag_free
|    8.73%  003  [k] cpu_map_bpf_prog_run_xdp
|    6.63%  006  [k] xdp_do_redirect
|    4.94%  003  [k] cpu_map_kthread_run
|    4.28%  006  [k] cpu_map_redirect
|    4.03%  006  [k] bq_flush_to_queue
|    3.01%  003  [k] xdp_return_frame
|    1.95%  006  [k] _raw_spin_lock
|    1.94%  003  [k] bpf_prog_57cd311f2e27366b_cpumap_drop

This diff appears to be noise.

v6…v7 https://lore.kernel.org/all/20240612170303.3896084-1-bigeasy@linutronix.de/:
- The softnet_data::xmit (08/15) is moved completely for RT and not
  limited to one member. During the review it has been noticed that the
  `more' member requires same protection.
  After some cleanup this might also be considered conditionally for !RT.

v5…v6 https://lore.kernel.org/all/20240607070427.1379327-1-bigeasy@linutronix.de/:
- bpf_redirect_info and the lists (cpu, dev, xsk map_flush_list) is now
  initialized on first usage instead during setup time
  (bpf_net_ctx_set()). bpf_redirect_info::kern_flags is used as
  status to note which member require an init.
  The bpf_redirect_info::kern_flags is moved to the end of the struct
  (after nh) in order to be excluded from the memset() which clears
  everything upto the the nh member.
  This whole lazy init performed to save cycles and not to waste them to
  memset bpf_redirect_info and init the three lists on each
  net_rx_action()/ NAPI invocation even if BPF/redirect is not used.
  Suggested by Jesper Dangaard Brouer.

- Collect Acked-by/Review-by from PeterZ/ tglx

v4…v5 https://lore.kernel.org/all/20240604154425.878636-1-bigeasy@linutronix.de/:
- Remove the guard() notation as well as __free() within the patches.
  Patch #1 and #2 add the guard definition for local_lock_nested_bh()
  but it remains unused with the series.
  The __free() notation for bpf_net_ctx_clear has been removed entirely.

- Collect Toke's Reviewed-by.

v3…v4 https://lore.kernel.org/all/20240529162927.403425-1-bigeasy@linutronix.de/:
- Removed bpf_clear_redirect_map(), moved the comment to the caller.
  Suggested by Toke.

- The bpf_redirect_info structure is memset() each time it is assigned.
  Suggested by Toke.

- The bpf_net_ctx_set() in __napi_busy_loop() has been moved from the
  top of the function to begin/ end of the BH-disabled section. This has
  been done to remain in sync with other call sites.
  After adding the memset() I've been looking at the perf-numbers in my
  test-case and I haven't noticed an impact, the numbers are in the same
  range with and without the change. Therefore I kept the numbers from
  previous posting.

- Collected Alexei's Acked-by.

v2…v3 https://lore.kernel.org/all/20240503182957.1042122-1-bigeasy@linutronix.de/:
- WARN checks checks for bpf_net_ctx_get() have been dropped and all
  NULL checks around it. This means bpf_net_ctx_get_ri() assumes the
  context has been set and will segfault if it is not the case.
  Suggested by Alexei and Jesper. This should always work or always
  segfault.

- It has been suggested by Toke to embed struct bpf_net_context into
  task_struct instead just a pointer to it. This would increase the size
  of task_struct by 112 bytes instead just eight and Alexei didn't like
  it due to the size impact with 1m threads. It is a pointer again.

v1…v2 https://lore.kernel.org/all/20231215171020.687342-1-bigeasy@linutronix.de/:
- Jakub complained about touching networking drivers to make the
  additional locking work. Alexei complained about the additional
  locking within the XDP/eBFP case.
  This led to a change in how the per-CPU variables are accessed for the
  XDP/eBPF case. On PREEMPT_RT the variables are now stored on stack and
  the task pointer to the structure is saved in the task_struct while
  keeping every for !RT unchanged. This was proposed as a RFC in
  	v1: https://lore.kernel.org/all/20240213145923.2552753-1-bigeasy@linutronix.de/

  and then updated

        v2: https://lore.kernel.org/all/20240229183109.646865-1-bigeasy@linutronix.de/
	  - Renamed the container struct from xdp_storage to bpf_net_context.
            Suggested by Toke Høiland-Jørgensen.
	  - Use the container struct also on !PREEMPT_RT builds. Store the
	    pointer to the on-stack struct in a per-CPU variable. Suggested by
            Toke Høiland-Jørgensen.

  This reduces the initial queue from 24 to 15 patches.

- There were complains about the scoped_guard() which shifts the whole
  block and makes it harder to review because the whole gets removed and
  added again. The usage has been replaced with local_lock_nested_bh()+
  its unlock counterpart.

Sebastian


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2024-06-19  1:56 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-18  7:13 [PATCH v7 net-next 00/15] locking: Introduce nested-BH locking Sebastian Andrzej Siewior
2024-06-18  7:13 ` [PATCH v7 net-next 01/15] locking/local_lock: Introduce guard definition for local_lock Sebastian Andrzej Siewior
2024-06-18  7:13 ` [PATCH v7 net-next 02/15] locking/local_lock: Add local nested BH locking infrastructure Sebastian Andrzej Siewior
2024-06-18  7:13 ` [PATCH v7 net-next 03/15] net: Use __napi_alloc_frag_align() instead of open coding it Sebastian Andrzej Siewior
2024-06-18  7:13 ` [PATCH v7 net-next 04/15] net: Use nested-BH locking for napi_alloc_cache Sebastian Andrzej Siewior
2024-06-18  7:13 ` [PATCH v7 net-next 05/15] net/tcp_sigpool: Use nested-BH locking for sigpool_scratch Sebastian Andrzej Siewior
2024-06-18  7:13 ` [PATCH v7 net-next 06/15] net/ipv4: Use nested-BH locking for ipv4_tcp_sk Sebastian Andrzej Siewior
2024-06-18  7:13 ` [PATCH v7 net-next 07/15] netfilter: br_netfilter: Use nested-BH locking for brnf_frag_data_storage Sebastian Andrzej Siewior
2024-06-18  7:13 ` [PATCH v7 net-next 08/15] net: softnet_data: Make xmit per task Sebastian Andrzej Siewior
2024-06-18  7:13 ` [PATCH v7 net-next 09/15] dev: Remove PREEMPT_RT ifdefs from backlog_lock.*() Sebastian Andrzej Siewior
2024-06-18  7:13 ` [PATCH v7 net-next 10/15] dev: Use nested-BH locking for softnet_data.process_queue Sebastian Andrzej Siewior
2024-06-18  7:13 ` [PATCH v7 net-next 11/15] lwt: Don't disable migration prio invoking BPF Sebastian Andrzej Siewior
2024-06-18  7:13 ` [PATCH v7 net-next 12/15] seg6: Use nested-BH locking for seg6_bpf_srh_states Sebastian Andrzej Siewior
2024-06-18  7:13 ` [PATCH v7 net-next 13/15] net: Use nested-BH locking for bpf_scratchpad Sebastian Andrzej Siewior
2024-06-18  7:13 ` [PATCH v7 net-next 14/15] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT Sebastian Andrzej Siewior
2024-06-18  8:14   ` Jesper Dangaard Brouer
2024-06-18  7:13 ` [PATCH v7 net-next 15/15] net: Move per-CPU flush-lists to bpf_net_context " Sebastian Andrzej Siewior
2024-06-18  8:17   ` Jesper Dangaard Brouer
2024-06-19  1:56 ` [PATCH v7 net-next 00/15] locking: Introduce nested-BH locking Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).