From: "Toke Høiland-Jørgensen" <toke@redhat.com>
To: paulmck@kernel.org, Eric Dumazet <edumazet@google.com>
Cc: Yan Zhai <yan@cloudflare.com>,
netdev@vger.kernel.org, "David S. Miller" <davem@davemloft.net>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
Jiri Pirko <jiri@resnulli.us>, Simon Horman <horms@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Lorenzo Bianconi <lorenzo@kernel.org>,
Coco Li <lixiaoyan@google.com>, Wei Wang <weiwan@google.com>,
Alexander Duyck <alexanderduyck@fb.com>,
Hannes Frederic Sowa <hannes@stressinduktion.org>,
linux-kernel@vger.kernel.org, rcu@vger.kernel.org,
bpf@vger.kernel.org, kernel-team@cloudflare.com
Subject: Re: [PATCH] net: raise RCU qs after each threaded NAPI poll
Date: Wed, 28 Feb 2024 12:50:53 +0100 [thread overview]
Message-ID: <87edcwerj6.fsf@toke.dk> (raw)
In-Reply-To: <d633c5b9-53a5-4cd6-9dbb-6623bb74c00b@paulmck-laptop>
"Paul E. McKenney" <paulmck@kernel.org> writes:
> On Tue, Feb 27, 2024 at 05:44:17PM +0100, Eric Dumazet wrote:
>> On Tue, Feb 27, 2024 at 4:44 PM Yan Zhai <yan@cloudflare.com> wrote:
>> >
>> > We noticed task RCUs being blocked when threaded NAPIs are very busy in
>> > production: detaching any BPF tracing programs, i.e. removing a ftrace
>> > trampoline, will simply block for very long in rcu_tasks_wait_gp. This
>> > ranges from hundreds of seconds to even an hour, severely harming any
>> > observability tools that rely on BPF tracing programs. It can be
>> > easily reproduced locally with following setup:
>> >
>> > ip netns add test1
>> > ip netns add test2
>> >
>> > ip -n test1 link add veth1 type veth peer name veth2 netns test2
>> >
>> > ip -n test1 link set veth1 up
>> > ip -n test1 link set lo up
>> > ip -n test2 link set veth2 up
>> > ip -n test2 link set lo up
>> >
>> > ip -n test1 addr add 192.168.1.2/31 dev veth1
>> > ip -n test1 addr add 1.1.1.1/32 dev lo
>> > ip -n test2 addr add 192.168.1.3/31 dev veth2
>> > ip -n test2 addr add 2.2.2.2/31 dev lo
>> >
>> > ip -n test1 route add default via 192.168.1.3
>> > ip -n test2 route add default via 192.168.1.2
>> >
>> > for i in `seq 10 210`; do
>> > for j in `seq 10 210`; do
>> > ip netns exec test2 iptables -I INPUT -s 3.3.$i.$j -p udp --dport 5201
>> > done
>> > done
>> >
>> > ip netns exec test2 ethtool -K veth2 gro on
>> > ip netns exec test2 bash -c 'echo 1 > /sys/class/net/veth2/threaded'
>> > ip netns exec test1 ethtool -K veth1 tso off
>> >
>> > Then run an iperf3 client/server and a bpftrace script can trigger it:
>> >
>> > ip netns exec test2 iperf3 -s -B 2.2.2.2 >/dev/null&
>> > ip netns exec test1 iperf3 -c 2.2.2.2 -B 1.1.1.1 -u -l 1500 -b 3g -t 100 >/dev/null&
>> > bpftrace -e 'kfunc:__napi_poll{@=count();} interval:s:1{exit();}'
>> >
>> > Above reproduce for net-next kernel with following RCU and preempt
>> > configuraitons:
>> >
>> > # RCU Subsystem
>> > CONFIG_TREE_RCU=y
>> > CONFIG_PREEMPT_RCU=y
>> > # CONFIG_RCU_EXPERT is not set
>> > CONFIG_SRCU=y
>> > CONFIG_TREE_SRCU=y
>> > CONFIG_TASKS_RCU_GENERIC=y
>> > CONFIG_TASKS_RCU=y
>> > CONFIG_TASKS_RUDE_RCU=y
>> > CONFIG_TASKS_TRACE_RCU=y
>> > CONFIG_RCU_STALL_COMMON=y
>> > CONFIG_RCU_NEED_SEGCBLIST=y
>> > # end of RCU Subsystem
>> > # RCU Debugging
>> > # CONFIG_RCU_SCALE_TEST is not set
>> > # CONFIG_RCU_TORTURE_TEST is not set
>> > # CONFIG_RCU_REF_SCALE_TEST is not set
>> > CONFIG_RCU_CPU_STALL_TIMEOUT=21
>> > CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=0
>> > # CONFIG_RCU_TRACE is not set
>> > # CONFIG_RCU_EQS_DEBUG is not set
>> > # end of RCU Debugging
>> >
>> > CONFIG_PREEMPT_BUILD=y
>> > # CONFIG_PREEMPT_NONE is not set
>> > CONFIG_PREEMPT_VOLUNTARY=y
>> > # CONFIG_PREEMPT is not set
>> > CONFIG_PREEMPT_COUNT=y
>> > CONFIG_PREEMPTION=y
>> > CONFIG_PREEMPT_DYNAMIC=y
>> > CONFIG_PREEMPT_RCU=y
>> > CONFIG_HAVE_PREEMPT_DYNAMIC=y
>> > CONFIG_HAVE_PREEMPT_DYNAMIC_CALL=y
>> > CONFIG_PREEMPT_NOTIFIERS=y
>> > # CONFIG_DEBUG_PREEMPT is not set
>> > # CONFIG_PREEMPT_TRACER is not set
>> > # CONFIG_PREEMPTIRQ_DELAY_TEST is not set
>> >
>> > An interesting observation is that, while tasks RCUs are blocked,
>> > related NAPI thread is still being scheduled (even across cores)
>> > regularly. Looking at the gp conditions, I am inclining to cond_resched
>> > after each __napi_poll being the problem: cond_resched enters the
>> > scheduler with PREEMPT bit, which does not account as a gp for tasks
>> > RCUs. Meanwhile, since the thread has been frequently resched, the
>> > normal scheduling point (no PREEMPT bit, accounted as a task RCU gp)
>> > seems to have very little chance to kick in. Given the nature of "busy
>> > polling" program, such NAPI thread won't have task->nvcsw or task->on_rq
>> > updated (other gp conditions), the result is that such NAPI thread is
>> > put on RCU holdouts list for indefinitely long time.
>> >
>> > This is simply fixed by mirroring the ksoftirqd behavior: after
>> > NAPI/softirq work, raise a RCU QS to help expedite the RCU period. No
>> > more blocking afterwards for the same setup.
>> >
>> > Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support")
>> > Signed-off-by: Yan Zhai <yan@cloudflare.com>
>> > ---
>> > net/core/dev.c | 4 ++++
>> > 1 file changed, 4 insertions(+)
>> >
>> > diff --git a/net/core/dev.c b/net/core/dev.c
>> > index 275fd5259a4a..6e41263ff5d3 100644
>> > --- a/net/core/dev.c
>> > +++ b/net/core/dev.c
>> > @@ -6773,6 +6773,10 @@ static int napi_threaded_poll(void *data)
>> > net_rps_action_and_irq_enable(sd);
>> > }
>> > skb_defer_free_flush(sd);
>
> Please put a comment here stating that RCU readers cannot cross
> this point.
>
> I need to add lockdep to rcu_softirq_qs() to catch placing this in an
> RCU read-side critical section. And a header comment noting that from
> an RCU perspective, it acts as a momentary enabling of preemption.
OK, so one question here: for XDP, we're basically treating
local_bh_disable/enable() as the RCU critical section, cf the discussion
we had a few years ago that led to this being documented[0]. So why is
it OK to have the rcu_softirq_qs() inside the bh disable/enable pair,
but not inside an rcu_read_lock() section?
Also, looking at the patch in question:
>> > + if (!IS_ENABLED(CONFIG_PREEMPT_RT))
>> > + rcu_softirq_qs();
>> > +
>> > local_bh_enable();
Why does that local_bh_enable() not accomplish the same thing as the qs?
-Toke
[0] https://lore.kernel.org/bpf/20210624160609.292325-6-toke@redhat.com/
next prev parent reply other threads:[~2024-02-28 11:50 UTC|newest]
Thread overview: 47+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-27 15:44 [PATCH] net: raise RCU qs after each threaded NAPI poll Yan Zhai
2024-02-27 16:44 ` Eric Dumazet
2024-02-27 18:32 ` Paul E. McKenney
2024-02-27 21:22 ` Yan Zhai
2024-02-27 22:58 ` Paul E. McKenney
2024-02-28 3:10 ` Jakub Kicinski
2024-02-28 4:42 ` Paul E. McKenney
2024-02-28 14:43 ` Jakub Kicinski
2024-02-28 15:15 ` Paul E. McKenney
2024-02-28 15:35 ` Jakub Kicinski
2024-02-28 15:57 ` Yan Zhai
2024-02-28 11:50 ` Toke Høiland-Jørgensen [this message]
2024-02-28 15:10 ` Paul E. McKenney
2024-02-28 15:48 ` Yan Zhai
2024-02-28 17:47 ` Paul E. McKenney
2024-02-28 15:37 ` Joel Fernandes
2024-02-28 16:37 ` Yan Zhai
2024-02-28 17:18 ` Paul E. McKenney
2024-02-28 20:14 ` Joel Fernandes
2024-02-28 21:13 ` Paul E. McKenney
2024-02-28 21:27 ` Joel Fernandes
2024-02-28 21:52 ` Paul E. McKenney
2024-02-28 22:10 ` Joel Fernandes
2024-02-28 22:19 ` Paul E. McKenney
2024-02-28 22:33 ` Steven Rostedt
2024-02-28 22:48 ` Alexei Starovoitov
2024-02-28 22:58 ` Paul E. McKenney
2024-02-29 14:21 ` Joel Fernandes
2024-02-29 16:57 ` Paul E. McKenney
2024-02-29 17:41 ` Joel Fernandes
2024-02-29 18:29 ` Paul E. McKenney
2024-03-02 2:24 ` Joel Fernandes
2024-03-03 0:25 ` Paul E. McKenney
2024-03-03 1:01 ` Joel Fernandes
2024-03-04 9:16 ` Joel Fernandes
2024-03-05 17:53 ` Paul E. McKenney
2024-03-05 19:57 ` Mark Rutland
2024-03-05 21:52 ` Paul E. McKenney
2024-03-06 16:56 ` Steven Rostedt
2024-03-07 16:57 ` Mark Rutland
2024-03-07 18:34 ` Mark Rutland
2024-03-07 18:52 ` Steven Rostedt
2024-03-07 18:58 ` Paul E. McKenney
2024-03-04 9:16 ` Joel Fernandes
2024-02-28 22:49 ` Paul E. McKenney
2024-02-27 21:17 ` Yan Zhai
2024-02-28 23:53 ` Yan Zhai
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87edcwerj6.fsf@toke.dk \
--to=toke@redhat.com \
--cc=alexanderduyck@fb.com \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=hannes@stressinduktion.org \
--cc=horms@kernel.org \
--cc=jiri@resnulli.us \
--cc=kernel-team@cloudflare.com \
--cc=kuba@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=lixiaoyan@google.com \
--cc=lorenzo@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=paulmck@kernel.org \
--cc=rcu@vger.kernel.org \
--cc=weiwan@google.com \
--cc=yan@cloudflare.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.