* [PATCH v5 net 0/3] Report RCU QS for busy network kthreads
@ 2024-03-19 20:44 Yan Zhai
2024-03-19 20:44 ` [PATCH v5 net 1/3] rcu: add a helper to report consolidated flavor QS Yan Zhai
` (4 more replies)
0 siblings, 5 replies; 17+ messages in thread
From: Yan Zhai @ 2024-03-19 20:44 UTC (permalink / raw)
To: netdev
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Jiri Pirko, Simon Horman, Daniel Borkmann, Lorenzo Bianconi,
Coco Li, Wei Wang, Alexander Duyck, linux-kernel, rcu, bpf,
kernel-team, Joel Fernandes, Paul E. McKenney,
Toke Høiland-Jørgensen, Alexei Starovoitov,
Steven Rostedt, mark.rutland, Jesper Dangaard Brouer,
Sebastian Andrzej Siewior
This changeset fixes a common problem for busy networking kthreads.
These threads, e.g. NAPI threads, typically will do:
* polling a batch of packets
* if there are more work, call cond_resched() to allow scheduling
* continue to poll more packets when rx queue is not empty
We observed this being a problem in production, since it can block RCU
tasks from making progress under heavy load. Investigation indicates
that just calling cond_resched() is insufficient for RCU tasks to reach
quiescent states. This also has the side effect of frequently clearing
the TIF_NEED_RESCHED flag on voluntary preempt kernels. As a result,
schedule() will not be called in these circumstances, despite schedule()
in fact provides required quiescent states. This at least affects NAPI
threads, napi_busy_loop, and also cpumap kthread.
By reporting RCU QSes in these kthreads periodically before cond_resched, the
blocked RCU waiters can correctly progress. Instead of just reporting QS for
RCU tasks, these code share the same concern as noted in the commit
d28139c4e967 ("rcu: Apply RCU-bh QSes to RCU-sched and RCU-preempt when safe").
So report a consolidated QS for safety.
It is worth noting that, although this problem is reproducible in
napi_busy_loop, it only shows up when setting the polling interval to as high
as 2ms, which is far larger than recommended 50us-100us in the documentation.
So napi_busy_loop is left untouched.
Lastly, this does not affect RT kernels, which does not enter the scheduler
through cond_resched(). Without the mentioned side effect, schedule() will
be called time by time, and clear the RCU task holdouts.
V4: https://lore.kernel.org/bpf/cover.1710525524.git.yan@cloudflare.com/
V3: https://lore.kernel.org/lkml/20240314145459.7b3aedf1@kernel.org/t/
V2: https://lore.kernel.org/bpf/ZeFPz4D121TgvCje@debian.debian/
V1: https://lore.kernel.org/lkml/Zd4DXTyCf17lcTfq@debian.debian/#t
changes since v4:
* polished comments and docs for the RCU helper as Paul McKenney suggested
changes since v3:
* fixed kernel-doc errors
changes since v2:
* created a helper in rcu header to abstract the behavior
* fixed cpumap kthread in addition
changes since v1:
* disable preemption first as Paul McKenney suggested
Yan Zhai (3):
rcu: add a helper to report consolidated flavor QS
net: report RCU QS on threaded NAPI repolling
bpf: report RCU QS in cpumap kthread
include/linux/rcupdate.h | 31 +++++++++++++++++++++++++++++++
kernel/bpf/cpumap.c | 3 +++
net/core/dev.c | 3 +++
3 files changed, 37 insertions(+)
--
2.30.2
^ permalink raw reply [flat|nested] 17+ messages in thread* [PATCH v5 net 1/3] rcu: add a helper to report consolidated flavor QS 2024-03-19 20:44 [PATCH v5 net 0/3] Report RCU QS for busy network kthreads Yan Zhai @ 2024-03-19 20:44 ` Yan Zhai 2024-03-19 21:31 ` Paul E. McKenney 2024-03-22 11:24 ` Sebastian Andrzej Siewior 2024-03-19 20:44 ` [PATCH v5 net 2/3] net: report RCU QS on threaded NAPI repolling Yan Zhai ` (3 subsequent siblings) 4 siblings, 2 replies; 17+ messages in thread From: Yan Zhai @ 2024-03-19 20:44 UTC (permalink / raw) To: netdev Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jiri Pirko, Simon Horman, Daniel Borkmann, Lorenzo Bianconi, Coco Li, Wei Wang, Alexander Duyck, linux-kernel, rcu, bpf, kernel-team, Joel Fernandes, Paul E. McKenney, Toke Høiland-Jørgensen, Alexei Starovoitov, Steven Rostedt, mark.rutland, Jesper Dangaard Brouer, Sebastian Andrzej Siewior When under heavy load, network processing can run CPU-bound for many tens of seconds. Even in preemptible kernels (non-RT kernel), this can block RCU Tasks grace periods, which can cause trace-event removal to take more than a minute, which is unacceptably long. This commit therefore creates a new helper function that passes through both RCU and RCU-Tasks quiescent states every 100 milliseconds. This hard-coded value suffices for current workloads. Suggested-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Yan Zhai <yan@cloudflare.com> --- v4->v5: adjusted kernel docs and commit message v3->v4: kernel docs error --- include/linux/rcupdate.h | 31 +++++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+) diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index 16f519914415..17d7ed5f3ae6 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -247,6 +247,37 @@ do { \ cond_resched(); \ } while (0) +/** + * rcu_softirq_qs_periodic - Report RCU and RCU-Tasks quiescent states + * @old_ts: jiffies at start of processing. + * + * This helper is for long-running softirq handlers, such as NAPI threads in + * networking. The caller should initialize the variable passed in as @old_ts + * at the beginning of the softirq handler. When invoked frequently, this macro + * will invoke rcu_softirq_qs() every 100 milliseconds thereafter, which will + * provide both RCU and RCU-Tasks quiescent states. Note that this macro + * modifies its old_ts argument. + * + * Because regions of code that have disabled softirq act as RCU read-side + * critical sections, this macro should be invoked with softirq (and + * preemption) enabled. + * + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would + * have more chance to invoke schedule() calls and provide necessary quiescent + * states. As a contrast, calling cond_resched() only won't achieve the same + * effect because cond_resched() does not provide RCU-Tasks quiescent states. + */ +#define rcu_softirq_qs_periodic(old_ts) \ +do { \ + if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \ + time_after(jiffies, (old_ts) + HZ / 10)) { \ + preempt_disable(); \ + rcu_softirq_qs(); \ + preempt_enable(); \ + (old_ts) = jiffies; \ + } \ +} while (0) + /* * Infrastructure to implement the synchronize_() primitives in * TREE_RCU and rcu_barrier_() primitives in TINY_RCU. -- 2.30.2 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH v5 net 1/3] rcu: add a helper to report consolidated flavor QS 2024-03-19 20:44 ` [PATCH v5 net 1/3] rcu: add a helper to report consolidated flavor QS Yan Zhai @ 2024-03-19 21:31 ` Paul E. McKenney 2024-03-19 22:00 ` Yan Zhai 2024-03-22 11:24 ` Sebastian Andrzej Siewior 1 sibling, 1 reply; 17+ messages in thread From: Paul E. McKenney @ 2024-03-19 21:31 UTC (permalink / raw) To: Yan Zhai Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jiri Pirko, Simon Horman, Daniel Borkmann, Lorenzo Bianconi, Coco Li, Wei Wang, Alexander Duyck, linux-kernel, rcu, bpf, kernel-team, Joel Fernandes, Toke Høiland-Jørgensen, Alexei Starovoitov, Steven Rostedt, mark.rutland, Jesper Dangaard Brouer, Sebastian Andrzej Siewior On Tue, Mar 19, 2024 at 01:44:34PM -0700, Yan Zhai wrote: > When under heavy load, network processing can run CPU-bound for many > tens of seconds. Even in preemptible kernels (non-RT kernel), this can > block RCU Tasks grace periods, which can cause trace-event removal to > take more than a minute, which is unacceptably long. > > This commit therefore creates a new helper function that passes through > both RCU and RCU-Tasks quiescent states every 100 milliseconds. This > hard-coded value suffices for current workloads. > > Suggested-by: Paul E. McKenney <paulmck@kernel.org> > Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org> > Signed-off-by: Yan Zhai <yan@cloudflare.com> If you would like me to take this one via -rcu, I would be happy to take it. If it would be easier for you to push these as a group though networking: Reviewed-by: Paul E. McKenney <paulmck@kernel.org> > --- > v4->v5: adjusted kernel docs and commit message > v3->v4: kernel docs error > > --- > include/linux/rcupdate.h | 31 +++++++++++++++++++++++++++++++ > 1 file changed, 31 insertions(+) > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h > index 16f519914415..17d7ed5f3ae6 100644 > --- a/include/linux/rcupdate.h > +++ b/include/linux/rcupdate.h > @@ -247,6 +247,37 @@ do { \ > cond_resched(); \ > } while (0) > > +/** > + * rcu_softirq_qs_periodic - Report RCU and RCU-Tasks quiescent states > + * @old_ts: jiffies at start of processing. > + * > + * This helper is for long-running softirq handlers, such as NAPI threads in > + * networking. The caller should initialize the variable passed in as @old_ts > + * at the beginning of the softirq handler. When invoked frequently, this macro > + * will invoke rcu_softirq_qs() every 100 milliseconds thereafter, which will > + * provide both RCU and RCU-Tasks quiescent states. Note that this macro > + * modifies its old_ts argument. > + * > + * Because regions of code that have disabled softirq act as RCU read-side > + * critical sections, this macro should be invoked with softirq (and > + * preemption) enabled. > + * > + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would > + * have more chance to invoke schedule() calls and provide necessary quiescent > + * states. As a contrast, calling cond_resched() only won't achieve the same > + * effect because cond_resched() does not provide RCU-Tasks quiescent states. > + */ > +#define rcu_softirq_qs_periodic(old_ts) \ > +do { \ > + if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \ > + time_after(jiffies, (old_ts) + HZ / 10)) { \ > + preempt_disable(); \ > + rcu_softirq_qs(); \ > + preempt_enable(); \ > + (old_ts) = jiffies; \ > + } \ > +} while (0) > + > /* > * Infrastructure to implement the synchronize_() primitives in > * TREE_RCU and rcu_barrier_() primitives in TINY_RCU. > -- > 2.30.2 > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v5 net 1/3] rcu: add a helper to report consolidated flavor QS 2024-03-19 21:31 ` Paul E. McKenney @ 2024-03-19 22:00 ` Yan Zhai 2024-03-19 22:08 ` Paul E. McKenney 0 siblings, 1 reply; 17+ messages in thread From: Yan Zhai @ 2024-03-19 22:00 UTC (permalink / raw) To: paulmck Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jiri Pirko, Simon Horman, Daniel Borkmann, Lorenzo Bianconi, Coco Li, Wei Wang, Alexander Duyck, linux-kernel, rcu, bpf, kernel-team, Joel Fernandes, Toke Høiland-Jørgensen, Alexei Starovoitov, Steven Rostedt, mark.rutland, Jesper Dangaard Brouer, Sebastian Andrzej Siewior Hi Paul, On Tue, Mar 19, 2024 at 4:31 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > On Tue, Mar 19, 2024 at 01:44:34PM -0700, Yan Zhai wrote: > > When under heavy load, network processing can run CPU-bound for many > > tens of seconds. Even in preemptible kernels (non-RT kernel), this can > > block RCU Tasks grace periods, which can cause trace-event removal to > > take more than a minute, which is unacceptably long. > > > > This commit therefore creates a new helper function that passes through > > both RCU and RCU-Tasks quiescent states every 100 milliseconds. This > > hard-coded value suffices for current workloads. > > > > Suggested-by: Paul E. McKenney <paulmck@kernel.org> > > Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org> > > Signed-off-by: Yan Zhai <yan@cloudflare.com> > > If you would like me to take this one via -rcu, I would be happy to take > it. If it would be easier for you to push these as a group though > networking: > > Reviewed-by: Paul E. McKenney <paulmck@kernel.org> > Since the whole series aims at fixing net problems, going through net is probably more consistent. Also, thank you for your help through the series! Yan > > --- > > v4->v5: adjusted kernel docs and commit message > > v3->v4: kernel docs error > > > > --- > > include/linux/rcupdate.h | 31 +++++++++++++++++++++++++++++++ > > 1 file changed, 31 insertions(+) > > > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h > > index 16f519914415..17d7ed5f3ae6 100644 > > --- a/include/linux/rcupdate.h > > +++ b/include/linux/rcupdate.h > > @@ -247,6 +247,37 @@ do { \ > > cond_resched(); \ > > } while (0) > > > > +/** > > + * rcu_softirq_qs_periodic - Report RCU and RCU-Tasks quiescent states > > + * @old_ts: jiffies at start of processing. > > + * > > + * This helper is for long-running softirq handlers, such as NAPI threads in > > + * networking. The caller should initialize the variable passed in as @old_ts > > + * at the beginning of the softirq handler. When invoked frequently, this macro > > + * will invoke rcu_softirq_qs() every 100 milliseconds thereafter, which will > > + * provide both RCU and RCU-Tasks quiescent states. Note that this macro > > + * modifies its old_ts argument. > > + * > > + * Because regions of code that have disabled softirq act as RCU read-side > > + * critical sections, this macro should be invoked with softirq (and > > + * preemption) enabled. > > + * > > + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would > > + * have more chance to invoke schedule() calls and provide necessary quiescent > > + * states. As a contrast, calling cond_resched() only won't achieve the same > > + * effect because cond_resched() does not provide RCU-Tasks quiescent states. > > + */ > > +#define rcu_softirq_qs_periodic(old_ts) \ > > +do { \ > > + if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \ > > + time_after(jiffies, (old_ts) + HZ / 10)) { \ > > + preempt_disable(); \ > > + rcu_softirq_qs(); \ > > + preempt_enable(); \ > > + (old_ts) = jiffies; \ > > + } \ > > +} while (0) > > + > > /* > > * Infrastructure to implement the synchronize_() primitives in > > * TREE_RCU and rcu_barrier_() primitives in TINY_RCU. > > -- > > 2.30.2 > > > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v5 net 1/3] rcu: add a helper to report consolidated flavor QS 2024-03-19 22:00 ` Yan Zhai @ 2024-03-19 22:08 ` Paul E. McKenney 0 siblings, 0 replies; 17+ messages in thread From: Paul E. McKenney @ 2024-03-19 22:08 UTC (permalink / raw) To: Yan Zhai Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jiri Pirko, Simon Horman, Daniel Borkmann, Lorenzo Bianconi, Coco Li, Wei Wang, Alexander Duyck, linux-kernel, rcu, bpf, kernel-team, Joel Fernandes, Toke Høiland-Jørgensen, Alexei Starovoitov, Steven Rostedt, mark.rutland, Jesper Dangaard Brouer, Sebastian Andrzej Siewior On Tue, Mar 19, 2024 at 05:00:24PM -0500, Yan Zhai wrote: > Hi Paul, > > On Tue, Mar 19, 2024 at 4:31 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > On Tue, Mar 19, 2024 at 01:44:34PM -0700, Yan Zhai wrote: > > > When under heavy load, network processing can run CPU-bound for many > > > tens of seconds. Even in preemptible kernels (non-RT kernel), this can > > > block RCU Tasks grace periods, which can cause trace-event removal to > > > take more than a minute, which is unacceptably long. > > > > > > This commit therefore creates a new helper function that passes through > > > both RCU and RCU-Tasks quiescent states every 100 milliseconds. This > > > hard-coded value suffices for current workloads. > > > > > > Suggested-by: Paul E. McKenney <paulmck@kernel.org> > > > Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org> > > > Signed-off-by: Yan Zhai <yan@cloudflare.com> > > > > If you would like me to take this one via -rcu, I would be happy to take > > it. If it would be easier for you to push these as a group though > > networking: > > > > Reviewed-by: Paul E. McKenney <paulmck@kernel.org> > > Since the whole series aims at fixing net problems, going through net > is probably more consistent. Very good! I will let you push it along. > Also, thank you for your help through the series! No, thank you! I had just been asked to find this slowdown when you posted the patch. So it worked out extremely well for me! ;-) Thanx, Paul > Yan > > > > --- > > > v4->v5: adjusted kernel docs and commit message > > > v3->v4: kernel docs error > > > > > > --- > > > include/linux/rcupdate.h | 31 +++++++++++++++++++++++++++++++ > > > 1 file changed, 31 insertions(+) > > > > > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h > > > index 16f519914415..17d7ed5f3ae6 100644 > > > --- a/include/linux/rcupdate.h > > > +++ b/include/linux/rcupdate.h > > > @@ -247,6 +247,37 @@ do { \ > > > cond_resched(); \ > > > } while (0) > > > > > > +/** > > > + * rcu_softirq_qs_periodic - Report RCU and RCU-Tasks quiescent states > > > + * @old_ts: jiffies at start of processing. > > > + * > > > + * This helper is for long-running softirq handlers, such as NAPI threads in > > > + * networking. The caller should initialize the variable passed in as @old_ts > > > + * at the beginning of the softirq handler. When invoked frequently, this macro > > > + * will invoke rcu_softirq_qs() every 100 milliseconds thereafter, which will > > > + * provide both RCU and RCU-Tasks quiescent states. Note that this macro > > > + * modifies its old_ts argument. > > > + * > > > + * Because regions of code that have disabled softirq act as RCU read-side > > > + * critical sections, this macro should be invoked with softirq (and > > > + * preemption) enabled. > > > + * > > > + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would > > > + * have more chance to invoke schedule() calls and provide necessary quiescent > > > + * states. As a contrast, calling cond_resched() only won't achieve the same > > > + * effect because cond_resched() does not provide RCU-Tasks quiescent states. > > > + */ > > > +#define rcu_softirq_qs_periodic(old_ts) \ > > > +do { \ > > > + if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \ > > > + time_after(jiffies, (old_ts) + HZ / 10)) { \ > > > + preempt_disable(); \ > > > + rcu_softirq_qs(); \ > > > + preempt_enable(); \ > > > + (old_ts) = jiffies; \ > > > + } \ > > > +} while (0) > > > + > > > /* > > > * Infrastructure to implement the synchronize_() primitives in > > > * TREE_RCU and rcu_barrier_() primitives in TINY_RCU. > > > -- > > > 2.30.2 > > > > > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v5 net 1/3] rcu: add a helper to report consolidated flavor QS 2024-03-19 20:44 ` [PATCH v5 net 1/3] rcu: add a helper to report consolidated flavor QS Yan Zhai 2024-03-19 21:31 ` Paul E. McKenney @ 2024-03-22 11:24 ` Sebastian Andrzej Siewior 2024-03-22 21:30 ` Paul E. McKenney 1 sibling, 1 reply; 17+ messages in thread From: Sebastian Andrzej Siewior @ 2024-03-22 11:24 UTC (permalink / raw) To: Yan Zhai Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jiri Pirko, Simon Horman, Daniel Borkmann, Lorenzo Bianconi, Coco Li, Wei Wang, Alexander Duyck, linux-kernel, rcu, bpf, kernel-team, Joel Fernandes, Paul E. McKenney, Toke Høiland-Jørgensen, Alexei Starovoitov, Steven Rostedt, mark.rutland, Jesper Dangaard Brouer On 2024-03-19 13:44:34 [-0700], Yan Zhai wrote: > index 16f519914415..17d7ed5f3ae6 100644 > --- a/include/linux/rcupdate.h > +++ b/include/linux/rcupdate.h > @@ -247,6 +247,37 @@ do { \ > cond_resched(); \ > } while (0) > > +/** > + * rcu_softirq_qs_periodic - Report RCU and RCU-Tasks quiescent states > + * @old_ts: jiffies at start of processing. > + * > + * This helper is for long-running softirq handlers, such as NAPI threads in > + * networking. The caller should initialize the variable passed in as @old_ts > + * at the beginning of the softirq handler. When invoked frequently, this macro > + * will invoke rcu_softirq_qs() every 100 milliseconds thereafter, which will > + * provide both RCU and RCU-Tasks quiescent states. Note that this macro > + * modifies its old_ts argument. > + * > + * Because regions of code that have disabled softirq act as RCU read-side > + * critical sections, this macro should be invoked with softirq (and > + * preemption) enabled. > + * > + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would > + * have more chance to invoke schedule() calls and provide necessary quiescent > + * states. As a contrast, calling cond_resched() only won't achieve the same > + * effect because cond_resched() does not provide RCU-Tasks quiescent states. > + */ Paul, so CONFIG_PREEMPTION is affected but CONFIG_PREEMPT_RT is not. Why does RT have more scheduling points? The RCU-Tasks thread is starving and yet there is no wake-up, correct? Shouldn't cond_resched() take care of RCU-Tasks's needs, too? This function is used by napi_threaded_poll() which is not invoked in softirq it is a simple thread which does disable BH but this is it. Any pending softirqs are served before the cond_resched(). This napi_threaded_poll() case _basically_ a busy thread doing a lot of work and delaying RCU-Tasks as far as I understand. The same may happen to other busy-worker which have cond_resched() between works, such as the kworker. Therefore I would expect to have some kind of timeout at which point NEED_RESCHED is set so that cond_resched() can do its work. > +#define rcu_softirq_qs_periodic(old_ts) \ > +do { \ > + if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \ > + time_after(jiffies, (old_ts) + HZ / 10)) { \ > + preempt_disable(); \ > + rcu_softirq_qs(); \ > + preempt_enable(); \ > + (old_ts) = jiffies; \ > + } \ > +} while (0) > + > /* > * Infrastructure to implement the synchronize_() primitives in > * TREE_RCU and rcu_barrier_() primitives in TINY_RCU. Sebastian ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v5 net 1/3] rcu: add a helper to report consolidated flavor QS 2024-03-22 11:24 ` Sebastian Andrzej Siewior @ 2024-03-22 21:30 ` Paul E. McKenney 2024-03-23 2:02 ` Yan Zhai 0 siblings, 1 reply; 17+ messages in thread From: Paul E. McKenney @ 2024-03-22 21:30 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Yan Zhai, netdev, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jiri Pirko, Simon Horman, Daniel Borkmann, Lorenzo Bianconi, Coco Li, Wei Wang, Alexander Duyck, linux-kernel, rcu, bpf, kernel-team, Joel Fernandes, Toke Høiland-Jørgensen, Alexei Starovoitov, Steven Rostedt, mark.rutland, Jesper Dangaard Brouer On Fri, Mar 22, 2024 at 12:24:13PM +0100, Sebastian Andrzej Siewior wrote: > On 2024-03-19 13:44:34 [-0700], Yan Zhai wrote: > > index 16f519914415..17d7ed5f3ae6 100644 > > --- a/include/linux/rcupdate.h > > +++ b/include/linux/rcupdate.h > > @@ -247,6 +247,37 @@ do { \ > > cond_resched(); \ > > } while (0) > > > > +/** > > + * rcu_softirq_qs_periodic - Report RCU and RCU-Tasks quiescent states > > + * @old_ts: jiffies at start of processing. > > + * > > + * This helper is for long-running softirq handlers, such as NAPI threads in > > + * networking. The caller should initialize the variable passed in as @old_ts > > + * at the beginning of the softirq handler. When invoked frequently, this macro > > + * will invoke rcu_softirq_qs() every 100 milliseconds thereafter, which will > > + * provide both RCU and RCU-Tasks quiescent states. Note that this macro > > + * modifies its old_ts argument. > > + * > > + * Because regions of code that have disabled softirq act as RCU read-side > > + * critical sections, this macro should be invoked with softirq (and > > + * preemption) enabled. > > + * > > + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would > > + * have more chance to invoke schedule() calls and provide necessary quiescent > > + * states. As a contrast, calling cond_resched() only won't achieve the same > > + * effect because cond_resched() does not provide RCU-Tasks quiescent states. > > + */ > > Paul, so CONFIG_PREEMPTION is affected but CONFIG_PREEMPT_RT is not. > Why does RT have more scheduling points? In RT, isn't BH-disabled code preemptible? But yes, this would not help RCU Tasks. > The RCU-Tasks thread is starving and yet there is no wake-up, correct? > Shouldn't cond_resched() take care of RCU-Tasks's needs, too? > This function is used by napi_threaded_poll() which is not invoked in > softirq it is a simple thread which does disable BH but this is it. Any > pending softirqs are served before the cond_resched(). > > This napi_threaded_poll() case _basically_ a busy thread doing a lot of > work and delaying RCU-Tasks as far as I understand. The same may happen > to other busy-worker which have cond_resched() between works, such as > the kworker. Therefore I would expect to have some kind of timeout at > which point NEED_RESCHED is set so that cond_resched() can do its work. A NEED_RESCHED with a cond_resched() would still be counted as a preemption. If we were intending to keep cond_resched(), I would be thinking in terms of changing that, but only for Tasks RCU. Given no cond_resched(), I would be thinking in terms of removing the check for CONFIG_PREEMPT_RT. Thoughts? Thanx, Paul > > +#define rcu_softirq_qs_periodic(old_ts) \ > > +do { \ > > + if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \ > > + time_after(jiffies, (old_ts) + HZ / 10)) { \ > > + preempt_disable(); \ > > + rcu_softirq_qs(); \ > > + preempt_enable(); \ > > + (old_ts) = jiffies; \ > > + } \ > > +} while (0) > > + > > /* > > * Infrastructure to implement the synchronize_() primitives in > > * TREE_RCU and rcu_barrier_() primitives in TINY_RCU. > > Sebastian ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v5 net 1/3] rcu: add a helper to report consolidated flavor QS 2024-03-22 21:30 ` Paul E. McKenney @ 2024-03-23 2:02 ` Yan Zhai 2024-03-23 23:53 ` Paul E. McKenney 2024-04-05 13:49 ` Sebastian Andrzej Siewior 0 siblings, 2 replies; 17+ messages in thread From: Yan Zhai @ 2024-03-23 2:02 UTC (permalink / raw) To: paulmck Cc: Sebastian Andrzej Siewior, netdev, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jiri Pirko, Simon Horman, Daniel Borkmann, Lorenzo Bianconi, Coco Li, Wei Wang, Alexander Duyck, linux-kernel, rcu, bpf, kernel-team, Joel Fernandes, Toke Høiland-Jørgensen, Alexei Starovoitov, Steven Rostedt, mark.rutland, Jesper Dangaard Brouer On Fri, Mar 22, 2024 at 4:31 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > On Fri, Mar 22, 2024 at 12:24:13PM +0100, Sebastian Andrzej Siewior wrote: > > On 2024-03-19 13:44:34 [-0700], Yan Zhai wrote: > > > + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would > > > + * have more chance to invoke schedule() calls and provide necessary quiescent > > > + * states. As a contrast, calling cond_resched() only won't achieve the same > > > + * effect because cond_resched() does not provide RCU-Tasks quiescent states. > > > + */ > > > > Paul, so CONFIG_PREEMPTION is affected but CONFIG_PREEMPT_RT is not. > > Why does RT have more scheduling points? > > In RT, isn't BH-disabled code preemptible? But yes, this would not help > RCU Tasks. > By "more chance to invoke schedule()", my thought was that cond_resched becomes no op on RT or PREEMPT kernel. So it will not call __schedule(SM_PEREEMPT), which clears the NEED_RESCHED flag. On a normal irq exit like timer, when NEED_RESCHED is on, schedule()/__schedule(0) can be called time by time then. __schedule(0) is good for RCU tasks, __schedule(SM_PREEMPT) is not. But I think this code comment does not take into account frequent preempt_schedule and irqentry_exit_cond_resched on a PREEMPT kernel. When returning to these busy kthreads, irqentry_exit_cond_resched is in fact called now, not schedule(). So likely __schedule(PREEMPT) is still called frequently, or even more frequently. So the code comment looks incorrect on the RT argument part. We probably should remove the "IS_ENABLED" condition really. Paul and Sebastian, does this sound reasonable to you? Yan ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v5 net 1/3] rcu: add a helper to report consolidated flavor QS 2024-03-23 2:02 ` Yan Zhai @ 2024-03-23 23:53 ` Paul E. McKenney 2024-04-05 13:49 ` Sebastian Andrzej Siewior 1 sibling, 0 replies; 17+ messages in thread From: Paul E. McKenney @ 2024-03-23 23:53 UTC (permalink / raw) To: Yan Zhai Cc: Sebastian Andrzej Siewior, netdev, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jiri Pirko, Simon Horman, Daniel Borkmann, Lorenzo Bianconi, Coco Li, Wei Wang, Alexander Duyck, linux-kernel, rcu, bpf, kernel-team, Joel Fernandes, Toke Høiland-Jørgensen, Alexei Starovoitov, Steven Rostedt, mark.rutland, Jesper Dangaard Brouer On Fri, Mar 22, 2024 at 09:02:02PM -0500, Yan Zhai wrote: > On Fri, Mar 22, 2024 at 4:31 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > On Fri, Mar 22, 2024 at 12:24:13PM +0100, Sebastian Andrzej Siewior wrote: > > > On 2024-03-19 13:44:34 [-0700], Yan Zhai wrote: > > > > + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would > > > > + * have more chance to invoke schedule() calls and provide necessary quiescent > > > > + * states. As a contrast, calling cond_resched() only won't achieve the same > > > > + * effect because cond_resched() does not provide RCU-Tasks quiescent states. > > > > + */ > > > > > > Paul, so CONFIG_PREEMPTION is affected but CONFIG_PREEMPT_RT is not. > > > Why does RT have more scheduling points? > > > > In RT, isn't BH-disabled code preemptible? But yes, this would not help > > RCU Tasks. > > > By "more chance to invoke schedule()", my thought was that > cond_resched becomes no op on RT or PREEMPT kernel. So it will not > call __schedule(SM_PEREEMPT), which clears the NEED_RESCHED flag. On a > normal irq exit like timer, when NEED_RESCHED is on, > schedule()/__schedule(0) can be called time by time then. > __schedule(0) is good for RCU tasks, __schedule(SM_PREEMPT) is not. > > But I think this code comment does not take into account frequent > preempt_schedule and irqentry_exit_cond_resched on a PREEMPT kernel. > When returning to these busy kthreads, irqentry_exit_cond_resched is > in fact called now, not schedule(). So likely __schedule(PREEMPT) is > still called frequently, or even more frequently. So the code comment > looks incorrect on the RT argument part. We probably should remove the > "IS_ENABLED" condition really. Paul and Sebastian, does this sound > reasonable to you? Removing the "IS_ENABLED(CONFIG_PREEMPT_RT)" condition makes a great deal of sense to me, but I must defer to Sebastian for any RT implications. Thanx, Paul ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v5 net 1/3] rcu: add a helper to report consolidated flavor QS 2024-03-23 2:02 ` Yan Zhai 2024-03-23 23:53 ` Paul E. McKenney @ 2024-04-05 13:49 ` Sebastian Andrzej Siewior 2024-04-05 18:13 ` Paul E. McKenney 1 sibling, 1 reply; 17+ messages in thread From: Sebastian Andrzej Siewior @ 2024-04-05 13:49 UTC (permalink / raw) To: Yan Zhai Cc: paulmck, netdev, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jiri Pirko, Simon Horman, Daniel Borkmann, Lorenzo Bianconi, Coco Li, Wei Wang, Alexander Duyck, linux-kernel, rcu, bpf, kernel-team, Joel Fernandes, Toke Høiland-Jørgensen, Alexei Starovoitov, Steven Rostedt, mark.rutland, Jesper Dangaard Brouer, Thomas Gleixner On 2024-03-22 21:02:02 [-0500], Yan Zhai wrote: > On Fri, Mar 22, 2024 at 4:31 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > On Fri, Mar 22, 2024 at 12:24:13PM +0100, Sebastian Andrzej Siewior wrote: > > > On 2024-03-19 13:44:34 [-0700], Yan Zhai wrote: > > > > + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would > > > > + * have more chance to invoke schedule() calls and provide necessary quiescent > > > > + * states. As a contrast, calling cond_resched() only won't achieve the same > > > > + * effect because cond_resched() does not provide RCU-Tasks quiescent states. > > > > + */ > > > > > > Paul, so CONFIG_PREEMPTION is affected but CONFIG_PREEMPT_RT is not. > > > Why does RT have more scheduling points? > > > > In RT, isn't BH-disabled code preemptible? But yes, this would not help > > RCU Tasks. Yes, it is but why does it matter? This is used in the NAPI thread which fully preemptible and does cond_resched(). This should be enough. > By "more chance to invoke schedule()", my thought was that > cond_resched becomes no op on RT or PREEMPT kernel. So it will not > call __schedule(SM_PEREEMPT), which clears the NEED_RESCHED flag. On a It will nop cond_resched(), correct. However once something sends NEED_RESCHED then the receiver of this flag will __schedule(SM_PEREEMPT) as soon as possible. That is either because the scheduler sends an IPI and the CPU will do it in the irq-exit path _or_ the thread does preempt_enable() (which includes local_bh_enable()) and the counter hits zero at which point the same context switch happens. Therefore I don't see a difference between CONFIG_PREEMPT and CONFIG_PREEMPT_RT. > normal irq exit like timer, when NEED_RESCHED is on, > schedule()/__schedule(0) can be called time by time then. This I can not parse. __schedule(0) means the task gives up on its own and goes to sleep. This does not happen for the NAPI-thread loop, kworker loop or any other loop that consumes one work item after the other and relies on cond_resched() in between. > __schedule(0) is good for RCU tasks, __schedule(SM_PREEMPT) is not. Okay and that is why? This means you expect that every thread gives up on its own which may take some time depending on the workload. This should not matter. If I see this right, the only difference is rcu_tasks_classic_qs() and I didn't figure out yet what it does. > But I think this code comment does not take into account frequent > preempt_schedule and irqentry_exit_cond_resched on a PREEMPT kernel. > When returning to these busy kthreads, irqentry_exit_cond_resched is > in fact called now, not schedule(). So likely __schedule(PREEMPT) is > still called frequently, or even more frequently. So the code comment > looks incorrect on the RT argument part. We probably should remove the > "IS_ENABLED" condition really. Paul and Sebastian, does this sound > reasonable to you? Can you walk me through it? Why is it so important for a task to give up voluntary? There is something wrong here with how RCU tasks works. We want to get rid of the sprinkled cond_resched(). This looks like a another variant of it that might be required in places with no explanation except it takes too long. > Yan Sebastian ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v5 net 1/3] rcu: add a helper to report consolidated flavor QS 2024-04-05 13:49 ` Sebastian Andrzej Siewior @ 2024-04-05 18:13 ` Paul E. McKenney 0 siblings, 0 replies; 17+ messages in thread From: Paul E. McKenney @ 2024-04-05 18:13 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Yan Zhai, netdev, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jiri Pirko, Simon Horman, Daniel Borkmann, Lorenzo Bianconi, Coco Li, Wei Wang, Alexander Duyck, linux-kernel, rcu, bpf, kernel-team, Joel Fernandes, Toke Høiland-Jørgensen, Alexei Starovoitov, Steven Rostedt, mark.rutland, Jesper Dangaard Brouer, Thomas Gleixner On Fri, Apr 05, 2024 at 03:49:46PM +0200, Sebastian Andrzej Siewior wrote: > On 2024-03-22 21:02:02 [-0500], Yan Zhai wrote: > > On Fri, Mar 22, 2024 at 4:31 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > On Fri, Mar 22, 2024 at 12:24:13PM +0100, Sebastian Andrzej Siewior wrote: > > > > On 2024-03-19 13:44:34 [-0700], Yan Zhai wrote: > > > > > + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would > > > > > + * have more chance to invoke schedule() calls and provide necessary quiescent > > > > > + * states. As a contrast, calling cond_resched() only won't achieve the same > > > > > + * effect because cond_resched() does not provide RCU-Tasks quiescent states. > > > > > + */ > > > > > > > > Paul, so CONFIG_PREEMPTION is affected but CONFIG_PREEMPT_RT is not. > > > > Why does RT have more scheduling points? > > > > > > In RT, isn't BH-disabled code preemptible? But yes, this would not help > > > RCU Tasks. > Yes, it is but why does it matter? This is used in the NAPI thread which > fully preemptible and does cond_resched(). This should be enough. By the time it gets to RCU, a cond_resched()-induced context switch looks like a preemption. This is fine for normal RCU, but Tasks RCU needs a voluntary context switch for a quiescent state. Which makes sense, given that code running in a trampoline that is invoked from preemptible code can itself be preempted. So that additional call to rcu_softirq_qs() is needed. (Which invokes rcu_tasks_qs() which in turn invokes rcu_tasks_classic_qs(). And maybe it is also needed for RT. The argument against would be that RT applications have significant idle time on the one hand or spend lots of time in nohz_full userspace on the other, both of which are quiescent states for RCU Tasks. But you tell me! > > By "more chance to invoke schedule()", my thought was that > > cond_resched becomes no op on RT or PREEMPT kernel. So it will not > > call __schedule(SM_PEREEMPT), which clears the NEED_RESCHED flag. On a > It will nop cond_resched(), correct. However once something sends > NEED_RESCHED then the receiver of this flag will __schedule(SM_PEREEMPT) > as soon as possible. That is either because the scheduler sends an IPI > and the CPU will do it in the irq-exit path _or_ the thread does > preempt_enable() (which includes local_bh_enable()) and the counter hits > zero at which point the same context switch happens. > > Therefore I don't see a difference between CONFIG_PREEMPT and > CONFIG_PREEMPT_RT. True, but again RCU Tasks needs a voluntary context switch and the resulting preemption therefore does not qualify. > > normal irq exit like timer, when NEED_RESCHED is on, > > schedule()/__schedule(0) can be called time by time then. > > This I can not parse. __schedule(0) means the task gives up on its own > and goes to sleep. This does not happen for the NAPI-thread loop, > kworker loop or any other loop that consumes one work item after the > other and relies on cond_resched() in between. > > > __schedule(0) is good for RCU tasks, __schedule(SM_PREEMPT) is not. > Okay and that is why? This means you expect that every thread gives up > on its own which may take some time depending on the workload. This > should not matter. > > If I see this right, the only difference is rcu_tasks_classic_qs() and I > didn't figure out yet what it does. It marks the task as having passed through a Tasks RCU quiescent state. Which works because this is called from task level (as opposed to from irq, softirq, or NMI), so it cannot be returning to a trampoline that is protected by Tasks RCU. Later on, the Tasks RCU grace-period kthread will see the marking, and remove this task from the list that is blocking the current Tasks-RCU grace period. > > But I think this code comment does not take into account frequent > > preempt_schedule and irqentry_exit_cond_resched on a PREEMPT kernel. > > When returning to these busy kthreads, irqentry_exit_cond_resched is > > in fact called now, not schedule(). So likely __schedule(PREEMPT) is > > still called frequently, or even more frequently. So the code comment > > looks incorrect on the RT argument part. We probably should remove the > > "IS_ENABLED" condition really. Paul and Sebastian, does this sound > > reasonable to you? > > Can you walk me through it? Why is it so important for a task to give up > voluntary? There is something wrong here with how RCU tasks works. > We want to get rid of the sprinkled cond_resched(). This looks like a > another variant of it that might be required in places with no > explanation except it takes too long. Hmmm... I would normally point you at the Requirements.rst [1] document for a walkthrough, but it does not cover all of this. How about the upgrade shown below? I agree that it would be nice if Tasks RCU did not have to depend on voluntary context switches. In theory, one alternative would be to examine each task's stack, looking for return addresses in trampolines. In practice, we recently took a look at this and other alternatives, but none were feasible [2]. If you know of a better way, please do not keep it a secret! Note that kernel live patching has similar needs, and may need similar annotations/innovations. Thanx, Paul [1] Documentation/RCU/Design/Requirements/Requirements.rst [2] https://docs.google.com/document/d/1kZY6AX-AHRIyYQsvUX6WJxS1LsDK4JA2CHuBnpkrR_U/edit?usp=sharing ------------------------------------------------------------------------ As written: ------------------------------------------------------------------------ Tasks RCU ~~~~~~~~~ Some forms of tracing use “trampolines” to handle the binary rewriting required to install different types of probes. It would be good to be able to free old trampolines, which sounds like a job for some form of RCU. However, because it is necessary to be able to install a trace anywhere in the code, it is not possible to use read-side markers such as rcu_read_lock() and rcu_read_unlock(). In addition, it does not work to have these markers in the trampoline itself, because there would need to be instructions following rcu_read_unlock(). Although synchronize_rcu() would guarantee that execution reached the rcu_read_unlock(), it would not be able to guarantee that execution had completely left the trampoline. Worse yet, in some situations the trampoline's protection must extend a few instructions *prior* to execution reaching the trampoline. For example, these few instructions might calculate the address of the trampoline, so that entering the trampoline would be pre-ordained a surprisingly long time before execution actually reached the trampoline itself. The solution, in the form of `Tasks RCU <https://lwn.net/Articles/607117/>`__, is to have implicit read-side critical sections that are delimited by voluntary context switches, that is, calls to schedule(), cond_resched(), and synchronize_rcu_tasks(). In addition, transitions to and from userspace execution also delimit tasks-RCU read-side critical sections. Idle tasks are ignored by Tasks RCU, and Tasks Rude RCU may be used to interact with them. Note well that involuntary context switches are *not* Tasks-RCU quiescent states. After all, in preemptible kernels, a task executing code in a trampoline might be preempted. In this case, the Tasks-RCU grace period clearly cannot end until that task resumes and its execution leaves that trampoline. This means, among other things, that cond_resched() does not provide a Tasks RCU quiescent state. (Instead, use rcu_softirq_qs() from softirq or rcu_tasks_classic_qs() otherwise.) The tasks-RCU API is quite compact, consisting only of call_rcu_tasks(), synchronize_rcu_tasks(), and rcu_barrier_tasks(). In ``CONFIG_PREEMPTION=n`` kernels, trampolines cannot be preempted, so these APIs map to call_rcu(), synchronize_rcu(), and rcu_barrier(), respectively. In ``CONFIG_PREEMPTION=y`` kernels, trampolines can be preempted, and these three APIs are therefore implemented by separate functions that check for voluntary context switches. ------------------------------------------------------------------------ As patch: ------------------------------------------------------------------------ diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst index cccafdaa1f849..f511476b45506 100644 --- a/Documentation/RCU/Design/Requirements/Requirements.rst +++ b/Documentation/RCU/Design/Requirements/Requirements.rst @@ -2357,6 +2357,7 @@ section. #. `Sched Flavor (Historical)`_ #. `Sleepable RCU`_ #. `Tasks RCU`_ +#. `Tasks Trace RCU`_ Bottom-Half Flavor (Historical) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -2610,6 +2611,16 @@ critical sections that are delimited by voluntary context switches, that is, calls to schedule(), cond_resched(), and synchronize_rcu_tasks(). In addition, transitions to and from userspace execution also delimit tasks-RCU read-side critical sections. +Idle tasks are ignored by Tasks RCU, and Tasks Rude RCU may be used to +interact with them. + +Note well that involuntary context switches are *not* Tasks-RCU quiescent +states. After all, in preemptible kernels, a task executing code in a +trampoline might be preempted. In this case, the Tasks-RCU grace period +clearly cannot end until that task resumes and its execution leaves that +trampoline. This means, among other things, that cond_resched() does +not provide a Tasks RCU quiescent state. (Instead, use rcu_softirq_qs() +from softirq or rcu_tasks_classic_qs() otherwise.) The tasks-RCU API is quite compact, consisting only of call_rcu_tasks(), synchronize_rcu_tasks(), and @@ -2632,6 +2643,11 @@ moniker. And this operation is considered to be quite rude by real-time workloads that don't want their ``nohz_full`` CPUs receiving IPIs and by battery-powered systems that don't want their idle CPUs to be awakened. +Once kernel entry/exit and deep-idle functions have been properly tagged +``noinstr``, Tasks RCU can start paying attention to idle tasks (except +those that are idle from RCU's perspective) and then Tasks Rude RCU can +be removed from the kernel. + The tasks-rude-RCU API is also reader-marking-free and thus quite compact, consisting of call_rcu_tasks_rude(), synchronize_rcu_tasks_rude(), and rcu_barrier_tasks_rude(). ^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v5 net 2/3] net: report RCU QS on threaded NAPI repolling 2024-03-19 20:44 [PATCH v5 net 0/3] Report RCU QS for busy network kthreads Yan Zhai 2024-03-19 20:44 ` [PATCH v5 net 1/3] rcu: add a helper to report consolidated flavor QS Yan Zhai @ 2024-03-19 20:44 ` Yan Zhai 2024-03-19 21:32 ` Paul E. McKenney 2024-03-19 20:44 ` [PATCH v5 net 3/3] bpf: report RCU QS in cpumap kthread Yan Zhai ` (2 subsequent siblings) 4 siblings, 1 reply; 17+ messages in thread From: Yan Zhai @ 2024-03-19 20:44 UTC (permalink / raw) To: netdev Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jiri Pirko, Simon Horman, Daniel Borkmann, Lorenzo Bianconi, Coco Li, Wei Wang, Alexander Duyck, linux-kernel, rcu, bpf, kernel-team, Joel Fernandes, Paul E. McKenney, Toke Høiland-Jørgensen, Alexei Starovoitov, Steven Rostedt, mark.rutland, Jesper Dangaard Brouer, Sebastian Andrzej Siewior NAPI threads can keep polling packets under load. Currently it is only calling cond_resched() before repolling, but it is not sufficient to clear out the holdout of RCU tasks, which prevent BPF tracing programs from detaching for long period. This can be reproduced easily with following set up: ip netns add test1 ip netns add test2 ip -n test1 link add veth1 type veth peer name veth2 netns test2 ip -n test1 link set veth1 up ip -n test1 link set lo up ip -n test2 link set veth2 up ip -n test2 link set lo up ip -n test1 addr add 192.168.1.2/31 dev veth1 ip -n test1 addr add 1.1.1.1/32 dev lo ip -n test2 addr add 192.168.1.3/31 dev veth2 ip -n test2 addr add 2.2.2.2/31 dev lo ip -n test1 route add default via 192.168.1.3 ip -n test2 route add default via 192.168.1.2 for i in `seq 10 210`; do for j in `seq 10 210`; do ip netns exec test2 iptables -I INPUT -s 3.3.$i.$j -p udp --dport 5201 done done ip netns exec test2 ethtool -K veth2 gro on ip netns exec test2 bash -c 'echo 1 > /sys/class/net/veth2/threaded' ip netns exec test1 ethtool -K veth1 tso off Then run an iperf3 client/server and a bpftrace script can trigger it: ip netns exec test2 iperf3 -s -B 2.2.2.2 >/dev/null& ip netns exec test1 iperf3 -c 2.2.2.2 -B 1.1.1.1 -u -l 1500 -b 3g -t 100 >/dev/null& bpftrace -e 'kfunc:__napi_poll{@=count();} interval:s:1{exit();}' Report RCU quiescent states periodically will resolve the issue. Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support") Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Yan Zhai <yan@cloudflare.com> --- v2->v3: abstracted the work into a RCU helper v1->v2: moved rcu_softirq_qs out from bh critical section, and only raise it after a second of repolling. Added some brief perf test result. v2: https://lore.kernel.org/bpf/ZeFPz4D121TgvCje@debian.debian/ v1: https://lore.kernel.org/lkml/Zd4DXTyCf17lcTfq@debian.debian/#t --- net/core/dev.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/core/dev.c b/net/core/dev.c index 303a6ff46e4e..9a67003e49db 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -6743,6 +6743,8 @@ static int napi_threaded_poll(void *data) void *have; while (!napi_thread_wait(napi)) { + unsigned long last_qs = jiffies; + for (;;) { bool repoll = false; @@ -6767,6 +6769,7 @@ static int napi_threaded_poll(void *data) if (!repoll) break; + rcu_softirq_qs_periodic(last_qs); cond_resched(); } } -- 2.30.2 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH v5 net 2/3] net: report RCU QS on threaded NAPI repolling 2024-03-19 20:44 ` [PATCH v5 net 2/3] net: report RCU QS on threaded NAPI repolling Yan Zhai @ 2024-03-19 21:32 ` Paul E. McKenney 0 siblings, 0 replies; 17+ messages in thread From: Paul E. McKenney @ 2024-03-19 21:32 UTC (permalink / raw) To: Yan Zhai Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jiri Pirko, Simon Horman, Daniel Borkmann, Lorenzo Bianconi, Coco Li, Wei Wang, Alexander Duyck, linux-kernel, rcu, bpf, kernel-team, Joel Fernandes, Toke Høiland-Jørgensen, Alexei Starovoitov, Steven Rostedt, mark.rutland, Jesper Dangaard Brouer, Sebastian Andrzej Siewior On Tue, Mar 19, 2024 at 01:44:37PM -0700, Yan Zhai wrote: > NAPI threads can keep polling packets under load. Currently it is only > calling cond_resched() before repolling, but it is not sufficient to > clear out the holdout of RCU tasks, which prevent BPF tracing programs > from detaching for long period. This can be reproduced easily with > following set up: > > ip netns add test1 > ip netns add test2 > > ip -n test1 link add veth1 type veth peer name veth2 netns test2 > > ip -n test1 link set veth1 up > ip -n test1 link set lo up > ip -n test2 link set veth2 up > ip -n test2 link set lo up > > ip -n test1 addr add 192.168.1.2/31 dev veth1 > ip -n test1 addr add 1.1.1.1/32 dev lo > ip -n test2 addr add 192.168.1.3/31 dev veth2 > ip -n test2 addr add 2.2.2.2/31 dev lo > > ip -n test1 route add default via 192.168.1.3 > ip -n test2 route add default via 192.168.1.2 > > for i in `seq 10 210`; do > for j in `seq 10 210`; do > ip netns exec test2 iptables -I INPUT -s 3.3.$i.$j -p udp --dport 5201 > done > done > > ip netns exec test2 ethtool -K veth2 gro on > ip netns exec test2 bash -c 'echo 1 > /sys/class/net/veth2/threaded' > ip netns exec test1 ethtool -K veth1 tso off > > Then run an iperf3 client/server and a bpftrace script can trigger it: > > ip netns exec test2 iperf3 -s -B 2.2.2.2 >/dev/null& > ip netns exec test1 iperf3 -c 2.2.2.2 -B 1.1.1.1 -u -l 1500 -b 3g -t 100 >/dev/null& > bpftrace -e 'kfunc:__napi_poll{@=count();} interval:s:1{exit();}' > > Report RCU quiescent states periodically will resolve the issue. > > Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support") > Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org> > Signed-off-by: Yan Zhai <yan@cloudflare.com> Acked-by: Paul E. McKenney <paulmck@kernel.org> > --- > v2->v3: abstracted the work into a RCU helper > v1->v2: moved rcu_softirq_qs out from bh critical section, and only > raise it after a second of repolling. Added some brief perf test result. > > v2: https://lore.kernel.org/bpf/ZeFPz4D121TgvCje@debian.debian/ > v1: https://lore.kernel.org/lkml/Zd4DXTyCf17lcTfq@debian.debian/#t > --- > net/core/dev.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/net/core/dev.c b/net/core/dev.c > index 303a6ff46e4e..9a67003e49db 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -6743,6 +6743,8 @@ static int napi_threaded_poll(void *data) > void *have; > > while (!napi_thread_wait(napi)) { > + unsigned long last_qs = jiffies; > + > for (;;) { > bool repoll = false; > > @@ -6767,6 +6769,7 @@ static int napi_threaded_poll(void *data) > if (!repoll) > break; > > + rcu_softirq_qs_periodic(last_qs); > cond_resched(); > } > } > -- > 2.30.2 > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH v5 net 3/3] bpf: report RCU QS in cpumap kthread 2024-03-19 20:44 [PATCH v5 net 0/3] Report RCU QS for busy network kthreads Yan Zhai 2024-03-19 20:44 ` [PATCH v5 net 1/3] rcu: add a helper to report consolidated flavor QS Yan Zhai 2024-03-19 20:44 ` [PATCH v5 net 2/3] net: report RCU QS on threaded NAPI repolling Yan Zhai @ 2024-03-19 20:44 ` Yan Zhai 2024-03-19 21:32 ` Paul E. McKenney 2024-03-20 10:30 ` [PATCH v5 net 0/3] Report RCU QS for busy network kthreads Jesper Dangaard Brouer 2024-03-21 4:40 ` patchwork-bot+netdevbpf 4 siblings, 1 reply; 17+ messages in thread From: Yan Zhai @ 2024-03-19 20:44 UTC (permalink / raw) To: netdev Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jiri Pirko, Simon Horman, Daniel Borkmann, Lorenzo Bianconi, Coco Li, Wei Wang, Alexander Duyck, linux-kernel, rcu, bpf, kernel-team, Joel Fernandes, Paul E. McKenney, Toke Høiland-Jørgensen, Alexei Starovoitov, Steven Rostedt, mark.rutland, Jesper Dangaard Brouer, Sebastian Andrzej Siewior When there are heavy load, cpumap kernel threads can be busy polling packets from redirect queues and block out RCU tasks from reaching quiescent states. It is insufficient to just call cond_resched() in such context. Periodically raise a consolidated RCU QS before cond_resched fixes the problem. Fixes: 6710e1126934 ("bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP") Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Yan Zhai <yan@cloudflare.com> --- kernel/bpf/cpumap.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c index 9ee8da477465..a8e34416e960 100644 --- a/kernel/bpf/cpumap.c +++ b/kernel/bpf/cpumap.c @@ -263,6 +263,7 @@ static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames, static int cpu_map_kthread_run(void *data) { struct bpf_cpu_map_entry *rcpu = data; + unsigned long last_qs = jiffies; complete(&rcpu->kthread_running); set_current_state(TASK_INTERRUPTIBLE); @@ -288,10 +289,12 @@ static int cpu_map_kthread_run(void *data) if (__ptr_ring_empty(rcpu->queue)) { schedule(); sched = 1; + last_qs = jiffies; } else { __set_current_state(TASK_RUNNING); } } else { + rcu_softirq_qs_periodic(last_qs); sched = cond_resched(); } -- 2.30.2 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH v5 net 3/3] bpf: report RCU QS in cpumap kthread 2024-03-19 20:44 ` [PATCH v5 net 3/3] bpf: report RCU QS in cpumap kthread Yan Zhai @ 2024-03-19 21:32 ` Paul E. McKenney 0 siblings, 0 replies; 17+ messages in thread From: Paul E. McKenney @ 2024-03-19 21:32 UTC (permalink / raw) To: Yan Zhai Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jiri Pirko, Simon Horman, Daniel Borkmann, Lorenzo Bianconi, Coco Li, Wei Wang, Alexander Duyck, linux-kernel, rcu, bpf, kernel-team, Joel Fernandes, Toke Høiland-Jørgensen, Alexei Starovoitov, Steven Rostedt, mark.rutland, Jesper Dangaard Brouer, Sebastian Andrzej Siewior On Tue, Mar 19, 2024 at 01:44:40PM -0700, Yan Zhai wrote: > When there are heavy load, cpumap kernel threads can be busy polling > packets from redirect queues and block out RCU tasks from reaching > quiescent states. It is insufficient to just call cond_resched() in such > context. Periodically raise a consolidated RCU QS before cond_resched > fixes the problem. > > Fixes: 6710e1126934 ("bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP") > Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org> > Signed-off-by: Yan Zhai <yan@cloudflare.com> Acked-by: Paul E. McKenney <paulmck@kernel.org> > --- > kernel/bpf/cpumap.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c > index 9ee8da477465..a8e34416e960 100644 > --- a/kernel/bpf/cpumap.c > +++ b/kernel/bpf/cpumap.c > @@ -263,6 +263,7 @@ static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames, > static int cpu_map_kthread_run(void *data) > { > struct bpf_cpu_map_entry *rcpu = data; > + unsigned long last_qs = jiffies; > > complete(&rcpu->kthread_running); > set_current_state(TASK_INTERRUPTIBLE); > @@ -288,10 +289,12 @@ static int cpu_map_kthread_run(void *data) > if (__ptr_ring_empty(rcpu->queue)) { > schedule(); > sched = 1; > + last_qs = jiffies; > } else { > __set_current_state(TASK_RUNNING); > } > } else { > + rcu_softirq_qs_periodic(last_qs); > sched = cond_resched(); > } > > -- > 2.30.2 > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v5 net 0/3] Report RCU QS for busy network kthreads 2024-03-19 20:44 [PATCH v5 net 0/3] Report RCU QS for busy network kthreads Yan Zhai ` (2 preceding siblings ...) 2024-03-19 20:44 ` [PATCH v5 net 3/3] bpf: report RCU QS in cpumap kthread Yan Zhai @ 2024-03-20 10:30 ` Jesper Dangaard Brouer 2024-03-21 4:40 ` patchwork-bot+netdevbpf 4 siblings, 0 replies; 17+ messages in thread From: Jesper Dangaard Brouer @ 2024-03-20 10:30 UTC (permalink / raw) To: Yan Zhai, netdev Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jiri Pirko, Simon Horman, Daniel Borkmann, Lorenzo Bianconi, Coco Li, Wei Wang, Alexander Duyck, linux-kernel, rcu, bpf, kernel-team, Joel Fernandes, Paul E. McKenney, Toke Høiland-Jørgensen, Alexei Starovoitov, Steven Rostedt, mark.rutland, Sebastian Andrzej Siewior On 19/03/2024 21.44, Yan Zhai wrote: > This changeset fixes a common problem for busy networking kthreads. > These threads, e.g. NAPI threads, typically will do: > > * polling a batch of packets > * if there are more work, call cond_resched() to allow scheduling > * continue to poll more packets when rx queue is not empty > > We observed this being a problem in production, since it can block RCU > tasks from making progress under heavy load. Investigation indicates > that just calling cond_resched() is insufficient for RCU tasks to reach > quiescent states. This also has the side effect of frequently clearing > the TIF_NEED_RESCHED flag on voluntary preempt kernels. As a result, > schedule() will not be called in these circumstances, despite schedule() > in fact provides required quiescent states. This at least affects NAPI > threads, napi_busy_loop, and also cpumap kthread. > > By reporting RCU QSes in these kthreads periodically before cond_resched, the > blocked RCU waiters can correctly progress. Instead of just reporting QS for > RCU tasks, these code share the same concern as noted in the commit > d28139c4e967 ("rcu: Apply RCU-bh QSes to RCU-sched and RCU-preempt when safe"). > So report a consolidated QS for safety. > > It is worth noting that, although this problem is reproducible in > napi_busy_loop, it only shows up when setting the polling interval to as high > as 2ms, which is far larger than recommended 50us-100us in the documentation. > So napi_busy_loop is left untouched. > > Lastly, this does not affect RT kernels, which does not enter the scheduler > through cond_resched(). Without the mentioned side effect, schedule() will > be called time by time, and clear the RCU task holdouts. > > V4: https://lore.kernel.org/bpf/cover.1710525524.git.yan@cloudflare.com/ > V3: https://lore.kernel.org/lkml/20240314145459.7b3aedf1@kernel.org/t/ > V2: https://lore.kernel.org/bpf/ZeFPz4D121TgvCje@debian.debian/ > V1: https://lore.kernel.org/lkml/Zd4DXTyCf17lcTfq@debian.debian/#t > > changes since v4: > * polished comments and docs for the RCU helper as Paul McKenney suggested > > changes since v3: > * fixed kernel-doc errors > > changes since v2: > * created a helper in rcu header to abstract the behavior > * fixed cpumap kthread in addition > > changes since v1: > * disable preemption first as Paul McKenney suggested > > Yan Zhai (3): > rcu: add a helper to report consolidated flavor QS > net: report RCU QS on threaded NAPI repolling > bpf: report RCU QS in cpumap kthread > > include/linux/rcupdate.h | 31 +++++++++++++++++++++++++++++++ > kernel/bpf/cpumap.c | 3 +++ > net/core/dev.c | 3 +++ > 3 files changed, 37 insertions(+) > Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v5 net 0/3] Report RCU QS for busy network kthreads 2024-03-19 20:44 [PATCH v5 net 0/3] Report RCU QS for busy network kthreads Yan Zhai ` (3 preceding siblings ...) 2024-03-20 10:30 ` [PATCH v5 net 0/3] Report RCU QS for busy network kthreads Jesper Dangaard Brouer @ 2024-03-21 4:40 ` patchwork-bot+netdevbpf 4 siblings, 0 replies; 17+ messages in thread From: patchwork-bot+netdevbpf @ 2024-03-21 4:40 UTC (permalink / raw) To: Yan Zhai Cc: netdev, davem, edumazet, kuba, pabeni, jiri, horms, daniel, lorenzo, lixiaoyan, weiwan, alexanderduyck, linux-kernel, rcu, bpf, kernel-team, joel, paulmck, toke, alexei.starovoitov, rostedt, mark.rutland, hawk, bigeasy Hello: This series was applied to netdev/net.git (main) by Jakub Kicinski <kuba@kernel.org>: On Tue, 19 Mar 2024 13:44:30 -0700 you wrote: > This changeset fixes a common problem for busy networking kthreads. > These threads, e.g. NAPI threads, typically will do: > > * polling a batch of packets > * if there are more work, call cond_resched() to allow scheduling > * continue to poll more packets when rx queue is not empty > > [...] Here is the summary with links: - [v5,net,1/3] rcu: add a helper to report consolidated flavor QS https://git.kernel.org/netdev/net/c/1a77557d48cf - [v5,net,2/3] net: report RCU QS on threaded NAPI repolling https://git.kernel.org/netdev/net/c/d6dbbb11247c - [v5,net,3/3] bpf: report RCU QS in cpumap kthread https://git.kernel.org/netdev/net/c/00bf63122459 You are awesome, thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/patchwork/pwbot.html ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2024-04-05 18:13 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-03-19 20:44 [PATCH v5 net 0/3] Report RCU QS for busy network kthreads Yan Zhai 2024-03-19 20:44 ` [PATCH v5 net 1/3] rcu: add a helper to report consolidated flavor QS Yan Zhai 2024-03-19 21:31 ` Paul E. McKenney 2024-03-19 22:00 ` Yan Zhai 2024-03-19 22:08 ` Paul E. McKenney 2024-03-22 11:24 ` Sebastian Andrzej Siewior 2024-03-22 21:30 ` Paul E. McKenney 2024-03-23 2:02 ` Yan Zhai 2024-03-23 23:53 ` Paul E. McKenney 2024-04-05 13:49 ` Sebastian Andrzej Siewior 2024-04-05 18:13 ` Paul E. McKenney 2024-03-19 20:44 ` [PATCH v5 net 2/3] net: report RCU QS on threaded NAPI repolling Yan Zhai 2024-03-19 21:32 ` Paul E. McKenney 2024-03-19 20:44 ` [PATCH v5 net 3/3] bpf: report RCU QS in cpumap kthread Yan Zhai 2024-03-19 21:32 ` Paul E. McKenney 2024-03-20 10:30 ` [PATCH v5 net 0/3] Report RCU QS for busy network kthreads Jesper Dangaard Brouer 2024-03-21 4:40 ` patchwork-bot+netdevbpf
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).