[PATCH v5 net 0/3] Report RCU QS for busy network kthreads

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 net 0/3] Report RCU QS for busy network kthreads
@ 2024-03-19 20:44 Yan Zhai
  2024-03-19 20:44 ` [PATCH v5 net 1/3] rcu: add a helper to report consolidated flavor QS Yan Zhai
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Yan Zhai @ 2024-03-19 20:44 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jiri Pirko, Simon Horman, Daniel Borkmann, Lorenzo Bianconi,
	Coco Li, Wei Wang, Alexander Duyck, linux-kernel, rcu, bpf,
	kernel-team, Joel Fernandes, Paul E. McKenney,
	Toke Høiland-Jørgensen, Alexei Starovoitov,
	Steven Rostedt, mark.rutland, Jesper Dangaard Brouer,
	Sebastian Andrzej Siewior

This changeset fixes a common problem for busy networking kthreads.
These threads, e.g. NAPI threads, typically will do:

* polling a batch of packets
* if there are more work, call cond_resched() to allow scheduling
* continue to poll more packets when rx queue is not empty

We observed this being a problem in production, since it can block RCU
tasks from making progress under heavy load. Investigation indicates
that just calling cond_resched() is insufficient for RCU tasks to reach
quiescent states. This also has the side effect of frequently clearing
the TIF_NEED_RESCHED flag on voluntary preempt kernels. As a result,
schedule() will not be called in these circumstances, despite schedule()
in fact provides required quiescent states. This at least affects NAPI
threads, napi_busy_loop, and also cpumap kthread.

By reporting RCU QSes in these kthreads periodically before cond_resched, the
blocked RCU waiters can correctly progress. Instead of just reporting QS for
RCU tasks, these code share the same concern as noted in the commit
d28139c4e967 ("rcu: Apply RCU-bh QSes to RCU-sched and RCU-preempt when safe").
So report a consolidated QS for safety.

It is worth noting that, although this problem is reproducible in
napi_busy_loop, it only shows up when setting the polling interval to as high
as 2ms, which is far larger than recommended 50us-100us in the documentation.
So napi_busy_loop is left untouched.

Lastly, this does not affect RT kernels, which does not enter the scheduler
through cond_resched(). Without the mentioned side effect, schedule() will
be called time by time, and clear the RCU task holdouts.

V4: https://lore.kernel.org/bpf/cover.1710525524.git.yan@cloudflare.com/
V3: https://lore.kernel.org/lkml/20240314145459.7b3aedf1@kernel.org/t/
V2: https://lore.kernel.org/bpf/ZeFPz4D121TgvCje@debian.debian/
V1: https://lore.kernel.org/lkml/Zd4DXTyCf17lcTfq@debian.debian/#t

changes since v4:
 * polished comments and docs for the RCU helper as Paul McKenney suggested

changes since v3:
 * fixed kernel-doc errors

changes since v2:
 * created a helper in rcu header to abstract the behavior
 * fixed cpumap kthread in addition

changes since v1:
 * disable preemption first as Paul McKenney suggested

Yan Zhai (3):
  rcu: add a helper to report consolidated flavor QS
  net: report RCU QS on threaded NAPI repolling
  bpf: report RCU QS in cpumap kthread

 include/linux/rcupdate.h | 31 +++++++++++++++++++++++++++++++
 kernel/bpf/cpumap.c      |  3 +++
 net/core/dev.c           |  3 +++
 3 files changed, 37 insertions(+)

-- 
2.30.2

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v5 net 1/3] rcu: add a helper to report consolidated flavor QS
  2024-03-19 20:44 [PATCH v5 net 0/3] Report RCU QS for busy network kthreads Yan Zhai
@ 2024-03-19 20:44 ` Yan Zhai
  2024-03-19 21:31   ` Paul E. McKenney
  2024-03-22 11:24   ` Sebastian Andrzej Siewior
  2024-03-19 20:44 ` [PATCH v5 net 2/3] net: report RCU QS on threaded NAPI repolling Yan Zhai
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 17+ messages in thread
From: Yan Zhai @ 2024-03-19 20:44 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jiri Pirko, Simon Horman, Daniel Borkmann, Lorenzo Bianconi,
	Coco Li, Wei Wang, Alexander Duyck, linux-kernel, rcu, bpf,
	kernel-team, Joel Fernandes, Paul E. McKenney,
	Toke Høiland-Jørgensen, Alexei Starovoitov,
	Steven Rostedt, mark.rutland, Jesper Dangaard Brouer,
	Sebastian Andrzej Siewior

When under heavy load, network processing can run CPU-bound for many
tens of seconds. Even in preemptible kernels (non-RT kernel), this can
block RCU Tasks grace periods, which can cause trace-event removal to
take more than a minute, which is unacceptably long.

This commit therefore creates a new helper function that passes through
both RCU and RCU-Tasks quiescent states every 100 milliseconds. This
hard-coded value suffices for current workloads.

Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Yan Zhai <yan@cloudflare.com>
---
v4->v5: adjusted kernel docs and commit message
v3->v4: kernel docs error

---
 include/linux/rcupdate.h | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 16f519914415..17d7ed5f3ae6 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -247,6 +247,37 @@ do { \
 	cond_resched(); \
 } while (0)
 
+/**
+ * rcu_softirq_qs_periodic - Report RCU and RCU-Tasks quiescent states
+ * @old_ts: jiffies at start of processing.
+ *
+ * This helper is for long-running softirq handlers, such as NAPI threads in
+ * networking. The caller should initialize the variable passed in as @old_ts
+ * at the beginning of the softirq handler. When invoked frequently, this macro
+ * will invoke rcu_softirq_qs() every 100 milliseconds thereafter, which will
+ * provide both RCU and RCU-Tasks quiescent states. Note that this macro
+ * modifies its old_ts argument.
+ *
+ * Because regions of code that have disabled softirq act as RCU read-side
+ * critical sections, this macro should be invoked with softirq (and
+ * preemption) enabled.
+ *
+ * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would
+ * have more chance to invoke schedule() calls and provide necessary quiescent
+ * states. As a contrast, calling cond_resched() only won't achieve the same
+ * effect because cond_resched() does not provide RCU-Tasks quiescent states.
+ */
+#define rcu_softirq_qs_periodic(old_ts) \
+do { \
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \
+	    time_after(jiffies, (old_ts) + HZ / 10)) { \
+		preempt_disable(); \
+		rcu_softirq_qs(); \
+		preempt_enable(); \
+		(old_ts) = jiffies; \
+	} \
+} while (0)
+
 /*
  * Infrastructure to implement the synchronize_() primitives in
  * TREE_RCU and rcu_barrier_() primitives in TINY_RCU.
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 net 1/3] rcu: add a helper to report consolidated flavor QS
  2024-03-19 20:44 ` [PATCH v5 net 1/3] rcu: add a helper to report consolidated flavor QS Yan Zhai
@ 2024-03-19 21:31   ` Paul E. McKenney
  2024-03-19 22:00     ` Yan Zhai
  2024-03-22 11:24   ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 17+ messages in thread
From: Paul E. McKenney @ 2024-03-19 21:31 UTC (permalink / raw)
  To: Yan Zhai
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jiri Pirko, Simon Horman, Daniel Borkmann,
	Lorenzo Bianconi, Coco Li, Wei Wang, Alexander Duyck,
	linux-kernel, rcu, bpf, kernel-team, Joel Fernandes,
	Toke Høiland-Jørgensen, Alexei Starovoitov,
	Steven Rostedt, mark.rutland, Jesper Dangaard Brouer,
	Sebastian Andrzej Siewior

On Tue, Mar 19, 2024 at 01:44:34PM -0700, Yan Zhai wrote:
> When under heavy load, network processing can run CPU-bound for many
> tens of seconds. Even in preemptible kernels (non-RT kernel), this can
> block RCU Tasks grace periods, which can cause trace-event removal to
> take more than a minute, which is unacceptably long.
> 
> This commit therefore creates a new helper function that passes through
> both RCU and RCU-Tasks quiescent states every 100 milliseconds. This
> hard-coded value suffices for current workloads.
> 
> Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
> Signed-off-by: Yan Zhai <yan@cloudflare.com>

If you would like me to take this one via -rcu, I would be happy to take
it.  If it would be easier for you to push these as a group though
networking:

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>

> ---
> v4->v5: adjusted kernel docs and commit message
> v3->v4: kernel docs error
> 
> ---
>  include/linux/rcupdate.h | 31 +++++++++++++++++++++++++++++++
>  1 file changed, 31 insertions(+)
> 
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 16f519914415..17d7ed5f3ae6 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -247,6 +247,37 @@ do { \
>  	cond_resched(); \
>  } while (0)
>  
> +/**
> + * rcu_softirq_qs_periodic - Report RCU and RCU-Tasks quiescent states
> + * @old_ts: jiffies at start of processing.
> + *
> + * This helper is for long-running softirq handlers, such as NAPI threads in
> + * networking. The caller should initialize the variable passed in as @old_ts
> + * at the beginning of the softirq handler. When invoked frequently, this macro
> + * will invoke rcu_softirq_qs() every 100 milliseconds thereafter, which will
> + * provide both RCU and RCU-Tasks quiescent states. Note that this macro
> + * modifies its old_ts argument.
> + *
> + * Because regions of code that have disabled softirq act as RCU read-side
> + * critical sections, this macro should be invoked with softirq (and
> + * preemption) enabled.
> + *
> + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would
> + * have more chance to invoke schedule() calls and provide necessary quiescent
> + * states. As a contrast, calling cond_resched() only won't achieve the same
> + * effect because cond_resched() does not provide RCU-Tasks quiescent states.
> + */
> +#define rcu_softirq_qs_periodic(old_ts) \
> +do { \
> +	if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \
> +	    time_after(jiffies, (old_ts) + HZ / 10)) { \
> +		preempt_disable(); \
> +		rcu_softirq_qs(); \
> +		preempt_enable(); \
> +		(old_ts) = jiffies; \
> +	} \
> +} while (0)
> +
>  /*
>   * Infrastructure to implement the synchronize_() primitives in
>   * TREE_RCU and rcu_barrier_() primitives in TINY_RCU.
> -- 
> 2.30.2
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 net 1/3] rcu: add a helper to report consolidated flavor QS
  2024-03-19 21:31   ` Paul E. McKenney
@ 2024-03-19 22:00     ` Yan Zhai
  2024-03-19 22:08       ` Paul E. McKenney
  0 siblings, 1 reply; 17+ messages in thread
From: Yan Zhai @ 2024-03-19 22:00 UTC (permalink / raw)
  To: paulmck
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jiri Pirko, Simon Horman, Daniel Borkmann,
	Lorenzo Bianconi, Coco Li, Wei Wang, Alexander Duyck,
	linux-kernel, rcu, bpf, kernel-team, Joel Fernandes,
	Toke Høiland-Jørgensen, Alexei Starovoitov,
	Steven Rostedt, mark.rutland, Jesper Dangaard Brouer,
	Sebastian Andrzej Siewior

Hi Paul,

On Tue, Mar 19, 2024 at 4:31 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Tue, Mar 19, 2024 at 01:44:34PM -0700, Yan Zhai wrote:
> > When under heavy load, network processing can run CPU-bound for many
> > tens of seconds. Even in preemptible kernels (non-RT kernel), this can
> > block RCU Tasks grace periods, which can cause trace-event removal to
> > take more than a minute, which is unacceptably long.
> >
> > This commit therefore creates a new helper function that passes through
> > both RCU and RCU-Tasks quiescent states every 100 milliseconds. This
> > hard-coded value suffices for current workloads.
> >
> > Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> > Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
> > Signed-off-by: Yan Zhai <yan@cloudflare.com>
>
> If you would like me to take this one via -rcu, I would be happy to take
> it.  If it would be easier for you to push these as a group though
> networking:
>
> Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
>

Since the whole series aims at fixing net problems, going through net
is probably more consistent.
Also, thank you for your help through the series!

Yan

> > ---
> > v4->v5: adjusted kernel docs and commit message
> > v3->v4: kernel docs error
> >
> > ---
> >  include/linux/rcupdate.h | 31 +++++++++++++++++++++++++++++++
> >  1 file changed, 31 insertions(+)
> >
> > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > index 16f519914415..17d7ed5f3ae6 100644
> > --- a/include/linux/rcupdate.h
> > +++ b/include/linux/rcupdate.h
> > @@ -247,6 +247,37 @@ do { \
> >       cond_resched(); \
> >  } while (0)
> >
> > +/**
> > + * rcu_softirq_qs_periodic - Report RCU and RCU-Tasks quiescent states
> > + * @old_ts: jiffies at start of processing.
> > + *
> > + * This helper is for long-running softirq handlers, such as NAPI threads in
> > + * networking. The caller should initialize the variable passed in as @old_ts
> > + * at the beginning of the softirq handler. When invoked frequently, this macro
> > + * will invoke rcu_softirq_qs() every 100 milliseconds thereafter, which will
> > + * provide both RCU and RCU-Tasks quiescent states. Note that this macro
> > + * modifies its old_ts argument.
> > + *
> > + * Because regions of code that have disabled softirq act as RCU read-side
> > + * critical sections, this macro should be invoked with softirq (and
> > + * preemption) enabled.
> > + *
> > + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would
> > + * have more chance to invoke schedule() calls and provide necessary quiescent
> > + * states. As a contrast, calling cond_resched() only won't achieve the same
> > + * effect because cond_resched() does not provide RCU-Tasks quiescent states.
> > + */
> > +#define rcu_softirq_qs_periodic(old_ts) \
> > +do { \
> > +     if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \
> > +         time_after(jiffies, (old_ts) + HZ / 10)) { \
> > +             preempt_disable(); \
> > +             rcu_softirq_qs(); \
> > +             preempt_enable(); \
> > +             (old_ts) = jiffies; \
> > +     } \
> > +} while (0)
> > +
> >  /*
> >   * Infrastructure to implement the synchronize_() primitives in
> >   * TREE_RCU and rcu_barrier_() primitives in TINY_RCU.
> > --
> > 2.30.2
> >
> >

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 net 1/3] rcu: add a helper to report consolidated flavor QS
  2024-03-19 22:00     ` Yan Zhai
@ 2024-03-19 22:08       ` Paul E. McKenney
  0 siblings, 0 replies; 17+ messages in thread
From: Paul E. McKenney @ 2024-03-19 22:08 UTC (permalink / raw)
  To: Yan Zhai
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jiri Pirko, Simon Horman, Daniel Borkmann,
	Lorenzo Bianconi, Coco Li, Wei Wang, Alexander Duyck,
	linux-kernel, rcu, bpf, kernel-team, Joel Fernandes,
	Toke Høiland-Jørgensen, Alexei Starovoitov,
	Steven Rostedt, mark.rutland, Jesper Dangaard Brouer,
	Sebastian Andrzej Siewior

On Tue, Mar 19, 2024 at 05:00:24PM -0500, Yan Zhai wrote:
> Hi Paul,
> 
> On Tue, Mar 19, 2024 at 4:31 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Tue, Mar 19, 2024 at 01:44:34PM -0700, Yan Zhai wrote:
> > > When under heavy load, network processing can run CPU-bound for many
> > > tens of seconds. Even in preemptible kernels (non-RT kernel), this can
> > > block RCU Tasks grace periods, which can cause trace-event removal to
> > > take more than a minute, which is unacceptably long.
> > >
> > > This commit therefore creates a new helper function that passes through
> > > both RCU and RCU-Tasks quiescent states every 100 milliseconds. This
> > > hard-coded value suffices for current workloads.
> > >
> > > Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> > > Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
> > > Signed-off-by: Yan Zhai <yan@cloudflare.com>
> >
> > If you would like me to take this one via -rcu, I would be happy to take
> > it.  If it would be easier for you to push these as a group though
> > networking:
> >
> > Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
> 
> Since the whole series aims at fixing net problems, going through net
> is probably more consistent.

Very good!  I will let you push it along.

> Also, thank you for your help through the series!

No, thank you!  I had just been asked to find this slowdown when you
posted the patch.  So it worked out extremely well for me!  ;-)

							Thanx, Paul

> Yan
> 
> > > ---
> > > v4->v5: adjusted kernel docs and commit message
> > > v3->v4: kernel docs error
> > >
> > > ---
> > >  include/linux/rcupdate.h | 31 +++++++++++++++++++++++++++++++
> > >  1 file changed, 31 insertions(+)
> > >
> > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > > index 16f519914415..17d7ed5f3ae6 100644
> > > --- a/include/linux/rcupdate.h
> > > +++ b/include/linux/rcupdate.h
> > > @@ -247,6 +247,37 @@ do { \
> > >       cond_resched(); \
> > >  } while (0)
> > >
> > > +/**
> > > + * rcu_softirq_qs_periodic - Report RCU and RCU-Tasks quiescent states
> > > + * @old_ts: jiffies at start of processing.
> > > + *
> > > + * This helper is for long-running softirq handlers, such as NAPI threads in
> > > + * networking. The caller should initialize the variable passed in as @old_ts
> > > + * at the beginning of the softirq handler. When invoked frequently, this macro
> > > + * will invoke rcu_softirq_qs() every 100 milliseconds thereafter, which will
> > > + * provide both RCU and RCU-Tasks quiescent states. Note that this macro
> > > + * modifies its old_ts argument.
> > > + *
> > > + * Because regions of code that have disabled softirq act as RCU read-side
> > > + * critical sections, this macro should be invoked with softirq (and
> > > + * preemption) enabled.
> > > + *
> > > + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would
> > > + * have more chance to invoke schedule() calls and provide necessary quiescent
> > > + * states. As a contrast, calling cond_resched() only won't achieve the same
> > > + * effect because cond_resched() does not provide RCU-Tasks quiescent states.
> > > + */
> > > +#define rcu_softirq_qs_periodic(old_ts) \
> > > +do { \
> > > +     if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \
> > > +         time_after(jiffies, (old_ts) + HZ / 10)) { \
> > > +             preempt_disable(); \
> > > +             rcu_softirq_qs(); \
> > > +             preempt_enable(); \
> > > +             (old_ts) = jiffies; \
> > > +     } \
> > > +} while (0)
> > > +
> > >  /*
> > >   * Infrastructure to implement the synchronize_() primitives in
> > >   * TREE_RCU and rcu_barrier_() primitives in TINY_RCU.
> > > --
> > > 2.30.2
> > >
> > >

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 net 1/3] rcu: add a helper to report consolidated flavor QS
  2024-03-19 20:44 ` [PATCH v5 net 1/3] rcu: add a helper to report consolidated flavor QS Yan Zhai
  2024-03-19 21:31   ` Paul E. McKenney
@ 2024-03-22 11:24   ` Sebastian Andrzej Siewior
  2024-03-22 21:30     ` Paul E. McKenney
  1 sibling, 1 reply; 17+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-03-22 11:24 UTC (permalink / raw)
  To: Yan Zhai
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jiri Pirko, Simon Horman, Daniel Borkmann,
	Lorenzo Bianconi, Coco Li, Wei Wang, Alexander Duyck,
	linux-kernel, rcu, bpf, kernel-team, Joel Fernandes,
	Paul E. McKenney, Toke Høiland-Jørgensen,
	Alexei Starovoitov, Steven Rostedt, mark.rutland,
	Jesper Dangaard Brouer

On 2024-03-19 13:44:34 [-0700], Yan Zhai wrote:
> index 16f519914415..17d7ed5f3ae6 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -247,6 +247,37 @@ do { \
>  	cond_resched(); \
>  } while (0)
>  
> +/**
> + * rcu_softirq_qs_periodic - Report RCU and RCU-Tasks quiescent states
> + * @old_ts: jiffies at start of processing.
> + *
> + * This helper is for long-running softirq handlers, such as NAPI threads in
> + * networking. The caller should initialize the variable passed in as @old_ts
> + * at the beginning of the softirq handler. When invoked frequently, this macro
> + * will invoke rcu_softirq_qs() every 100 milliseconds thereafter, which will
> + * provide both RCU and RCU-Tasks quiescent states. Note that this macro
> + * modifies its old_ts argument.
> + *
> + * Because regions of code that have disabled softirq act as RCU read-side
> + * critical sections, this macro should be invoked with softirq (and
> + * preemption) enabled.
> + *
> + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would
> + * have more chance to invoke schedule() calls and provide necessary quiescent
> + * states. As a contrast, calling cond_resched() only won't achieve the same
> + * effect because cond_resched() does not provide RCU-Tasks quiescent states.
> + */

Paul, so CONFIG_PREEMPTION is affected but CONFIG_PREEMPT_RT is not.
Why does RT have more scheduling points?
The RCU-Tasks thread is starving and yet there is no wake-up, correct?
Shouldn't cond_resched() take care of RCU-Tasks's needs, too?
This function is used by napi_threaded_poll() which is not invoked in
softirq it is a simple thread which does disable BH but this is it. Any
pending softirqs are served before the cond_resched().

This napi_threaded_poll() case _basically_ a busy thread doing a lot of
work and delaying RCU-Tasks as far as I understand. The same may happen
to other busy-worker which have cond_resched() between works, such as
the kworker. Therefore I would expect to have some kind of timeout at
which point NEED_RESCHED is set so that cond_resched() can do its work.

> +#define rcu_softirq_qs_periodic(old_ts) \
> +do { \
> +	if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \
> +	    time_after(jiffies, (old_ts) + HZ / 10)) { \
> +		preempt_disable(); \
> +		rcu_softirq_qs(); \
> +		preempt_enable(); \
> +		(old_ts) = jiffies; \
> +	} \
> +} while (0)
> +
>  /*
>   * Infrastructure to implement the synchronize_() primitives in
>   * TREE_RCU and rcu_barrier_() primitives in TINY_RCU.

Sebastian

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 net 1/3] rcu: add a helper to report consolidated flavor QS
  2024-03-22 11:24   ` Sebastian Andrzej Siewior
@ 2024-03-22 21:30     ` Paul E. McKenney
  2024-03-23  2:02       ` Yan Zhai
  0 siblings, 1 reply; 17+ messages in thread
From: Paul E. McKenney @ 2024-03-22 21:30 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Yan Zhai, netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jiri Pirko, Simon Horman, Daniel Borkmann,
	Lorenzo Bianconi, Coco Li, Wei Wang, Alexander Duyck,
	linux-kernel, rcu, bpf, kernel-team, Joel Fernandes,
	Toke Høiland-Jørgensen, Alexei Starovoitov,
	Steven Rostedt, mark.rutland, Jesper Dangaard Brouer

On Fri, Mar 22, 2024 at 12:24:13PM +0100, Sebastian Andrzej Siewior wrote:
> On 2024-03-19 13:44:34 [-0700], Yan Zhai wrote:
> > index 16f519914415..17d7ed5f3ae6 100644
> > --- a/include/linux/rcupdate.h
> > +++ b/include/linux/rcupdate.h
> > @@ -247,6 +247,37 @@ do { \
> >  	cond_resched(); \
> >  } while (0)
> >  
> > +/**
> > + * rcu_softirq_qs_periodic - Report RCU and RCU-Tasks quiescent states
> > + * @old_ts: jiffies at start of processing.
> > + *
> > + * This helper is for long-running softirq handlers, such as NAPI threads in
> > + * networking. The caller should initialize the variable passed in as @old_ts
> > + * at the beginning of the softirq handler. When invoked frequently, this macro
> > + * will invoke rcu_softirq_qs() every 100 milliseconds thereafter, which will
> > + * provide both RCU and RCU-Tasks quiescent states. Note that this macro
> > + * modifies its old_ts argument.
> > + *
> > + * Because regions of code that have disabled softirq act as RCU read-side
> > + * critical sections, this macro should be invoked with softirq (and
> > + * preemption) enabled.
> > + *
> > + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would
> > + * have more chance to invoke schedule() calls and provide necessary quiescent
> > + * states. As a contrast, calling cond_resched() only won't achieve the same
> > + * effect because cond_resched() does not provide RCU-Tasks quiescent states.
> > + */
> 
> Paul, so CONFIG_PREEMPTION is affected but CONFIG_PREEMPT_RT is not.
> Why does RT have more scheduling points?

In RT, isn't BH-disabled code preemptible?  But yes, this would not help
RCU Tasks.

> The RCU-Tasks thread is starving and yet there is no wake-up, correct?
> Shouldn't cond_resched() take care of RCU-Tasks's needs, too?
> This function is used by napi_threaded_poll() which is not invoked in
> softirq it is a simple thread which does disable BH but this is it. Any
> pending softirqs are served before the cond_resched().
> 
> This napi_threaded_poll() case _basically_ a busy thread doing a lot of
> work and delaying RCU-Tasks as far as I understand. The same may happen
> to other busy-worker which have cond_resched() between works, such as
> the kworker. Therefore I would expect to have some kind of timeout at
> which point NEED_RESCHED is set so that cond_resched() can do its work.

A NEED_RESCHED with a cond_resched() would still be counted as a
preemption.  If we were intending to keep cond_resched(), I would
be thinking in terms of changing that, but only for Tasks RCU.

Given no cond_resched(), I would be thinking in terms of removing
the check for CONFIG_PREEMPT_RT.

Thoughts?

							Thanx, Paul

> > +#define rcu_softirq_qs_periodic(old_ts) \
> > +do { \
> > +	if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \
> > +	    time_after(jiffies, (old_ts) + HZ / 10)) { \
> > +		preempt_disable(); \
> > +		rcu_softirq_qs(); \
> > +		preempt_enable(); \
> > +		(old_ts) = jiffies; \
> > +	} \
> > +} while (0)
> > +
> >  /*
> >   * Infrastructure to implement the synchronize_() primitives in
> >   * TREE_RCU and rcu_barrier_() primitives in TINY_RCU.
> 
> Sebastian

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 net 1/3] rcu: add a helper to report consolidated flavor QS
  2024-03-22 21:30     ` Paul E. McKenney
@ 2024-03-23  2:02       ` Yan Zhai
  2024-03-23 23:53         ` Paul E. McKenney
  2024-04-05 13:49         ` Sebastian Andrzej Siewior
  0 siblings, 2 replies; 17+ messages in thread
From: Yan Zhai @ 2024-03-23  2:02 UTC (permalink / raw)
  To: paulmck
  Cc: Sebastian Andrzej Siewior, netdev, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Jiri Pirko, Simon Horman,
	Daniel Borkmann, Lorenzo Bianconi, Coco Li, Wei Wang,
	Alexander Duyck, linux-kernel, rcu, bpf, kernel-team,
	Joel Fernandes, Toke Høiland-Jørgensen,
	Alexei Starovoitov, Steven Rostedt, mark.rutland,
	Jesper Dangaard Brouer

On Fri, Mar 22, 2024 at 4:31 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Fri, Mar 22, 2024 at 12:24:13PM +0100, Sebastian Andrzej Siewior wrote:
> > On 2024-03-19 13:44:34 [-0700], Yan Zhai wrote:
> > > + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would
> > > + * have more chance to invoke schedule() calls and provide necessary quiescent
> > > + * states. As a contrast, calling cond_resched() only won't achieve the same
> > > + * effect because cond_resched() does not provide RCU-Tasks quiescent states.
> > > + */
> >
> > Paul, so CONFIG_PREEMPTION is affected but CONFIG_PREEMPT_RT is not.
> > Why does RT have more scheduling points?
>
> In RT, isn't BH-disabled code preemptible?  But yes, this would not help
> RCU Tasks.
>
By "more chance to invoke schedule()", my thought was that
cond_resched becomes no op on RT or PREEMPT kernel. So it will not
call __schedule(SM_PEREEMPT), which clears the NEED_RESCHED flag. On a
normal irq exit like timer, when NEED_RESCHED is on,
schedule()/__schedule(0) can be called time by time then.
__schedule(0) is good for RCU tasks, __schedule(SM_PREEMPT) is not.

But I think this code comment does not take into account frequent
preempt_schedule and irqentry_exit_cond_resched on a PREEMPT kernel.
When returning to these busy kthreads, irqentry_exit_cond_resched is
in fact called now, not schedule(). So likely __schedule(PREEMPT) is
still called frequently, or even more frequently. So the code comment
looks incorrect on the RT argument part. We probably should remove the
"IS_ENABLED" condition really. Paul and Sebastian, does this sound
reasonable to you?

Yan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 net 1/3] rcu: add a helper to report consolidated flavor QS
  2024-03-23  2:02       ` Yan Zhai
@ 2024-03-23 23:53         ` Paul E. McKenney
  2024-04-05 13:49         ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 17+ messages in thread
From: Paul E. McKenney @ 2024-03-23 23:53 UTC (permalink / raw)
  To: Yan Zhai
  Cc: Sebastian Andrzej Siewior, netdev, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Jiri Pirko, Simon Horman,
	Daniel Borkmann, Lorenzo Bianconi, Coco Li, Wei Wang,
	Alexander Duyck, linux-kernel, rcu, bpf, kernel-team,
	Joel Fernandes, Toke Høiland-Jørgensen,
	Alexei Starovoitov, Steven Rostedt, mark.rutland,
	Jesper Dangaard Brouer

On Fri, Mar 22, 2024 at 09:02:02PM -0500, Yan Zhai wrote:
> On Fri, Mar 22, 2024 at 4:31 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Fri, Mar 22, 2024 at 12:24:13PM +0100, Sebastian Andrzej Siewior wrote:
> > > On 2024-03-19 13:44:34 [-0700], Yan Zhai wrote:
> > > > + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would
> > > > + * have more chance to invoke schedule() calls and provide necessary quiescent
> > > > + * states. As a contrast, calling cond_resched() only won't achieve the same
> > > > + * effect because cond_resched() does not provide RCU-Tasks quiescent states.
> > > > + */
> > >
> > > Paul, so CONFIG_PREEMPTION is affected but CONFIG_PREEMPT_RT is not.
> > > Why does RT have more scheduling points?
> >
> > In RT, isn't BH-disabled code preemptible?  But yes, this would not help
> > RCU Tasks.
> >
> By "more chance to invoke schedule()", my thought was that
> cond_resched becomes no op on RT or PREEMPT kernel. So it will not
> call __schedule(SM_PEREEMPT), which clears the NEED_RESCHED flag. On a
> normal irq exit like timer, when NEED_RESCHED is on,
> schedule()/__schedule(0) can be called time by time then.
> __schedule(0) is good for RCU tasks, __schedule(SM_PREEMPT) is not.
> 
> But I think this code comment does not take into account frequent
> preempt_schedule and irqentry_exit_cond_resched on a PREEMPT kernel.
> When returning to these busy kthreads, irqentry_exit_cond_resched is
> in fact called now, not schedule(). So likely __schedule(PREEMPT) is
> still called frequently, or even more frequently. So the code comment
> looks incorrect on the RT argument part. We probably should remove the
> "IS_ENABLED" condition really. Paul and Sebastian, does this sound
> reasonable to you?

Removing the "IS_ENABLED(CONFIG_PREEMPT_RT)" condition makes a great deal
of sense to me, but I must defer to Sebastian for any RT implications.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 net 1/3] rcu: add a helper to report consolidated flavor QS
  2024-03-23  2:02       ` Yan Zhai
  2024-03-23 23:53         ` Paul E. McKenney
@ 2024-04-05 13:49         ` Sebastian Andrzej Siewior
  2024-04-05 18:13           ` Paul E. McKenney
  1 sibling, 1 reply; 17+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-04-05 13:49 UTC (permalink / raw)
  To: Yan Zhai
  Cc: paulmck, netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jiri Pirko, Simon Horman, Daniel Borkmann,
	Lorenzo Bianconi, Coco Li, Wei Wang, Alexander Duyck,
	linux-kernel, rcu, bpf, kernel-team, Joel Fernandes,
	Toke Høiland-Jørgensen, Alexei Starovoitov,
	Steven Rostedt, mark.rutland, Jesper Dangaard Brouer,
	Thomas Gleixner

On 2024-03-22 21:02:02 [-0500], Yan Zhai wrote:
> On Fri, Mar 22, 2024 at 4:31 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Fri, Mar 22, 2024 at 12:24:13PM +0100, Sebastian Andrzej Siewior wrote:
> > > On 2024-03-19 13:44:34 [-0700], Yan Zhai wrote:
> > > > + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would
> > > > + * have more chance to invoke schedule() calls and provide necessary quiescent
> > > > + * states. As a contrast, calling cond_resched() only won't achieve the same
> > > > + * effect because cond_resched() does not provide RCU-Tasks quiescent states.
> > > > + */
> > >
> > > Paul, so CONFIG_PREEMPTION is affected but CONFIG_PREEMPT_RT is not.
> > > Why does RT have more scheduling points?
> >
> > In RT, isn't BH-disabled code preemptible?  But yes, this would not help
> > RCU Tasks.
Yes, it is but why does it matter? This is used in the NAPI thread which
fully preemptible and does cond_resched(). This should be enough.

> By "more chance to invoke schedule()", my thought was that
> cond_resched becomes no op on RT or PREEMPT kernel. So it will not
> call __schedule(SM_PEREEMPT), which clears the NEED_RESCHED flag. On a
It will nop cond_resched(), correct. However once something sends
NEED_RESCHED then the receiver of this flag will __schedule(SM_PEREEMPT)
as soon as possible. That is either because the scheduler sends an IPI
and the CPU will do it in the irq-exit path _or_ the thread does
preempt_enable() (which includes local_bh_enable()) and the counter hits
zero at which point the same context switch happens.

Therefore I don't see a difference between CONFIG_PREEMPT and
CONFIG_PREEMPT_RT.

> normal irq exit like timer, when NEED_RESCHED is on,
> schedule()/__schedule(0) can be called time by time then.

This I can not parse. __schedule(0) means the task gives up on its own
and goes to sleep. This does not happen for the NAPI-thread loop,
kworker loop or any other loop that consumes one work item after the
other and relies on cond_resched() in between.

> __schedule(0) is good for RCU tasks, __schedule(SM_PREEMPT) is not.
Okay and that is why? This means you expect that every thread gives up
on its own which may take some time depending on the workload. This
should not matter.

If I see this right, the only difference is rcu_tasks_classic_qs() and I
didn't figure out yet what it does.

> But I think this code comment does not take into account frequent
> preempt_schedule and irqentry_exit_cond_resched on a PREEMPT kernel.
> When returning to these busy kthreads, irqentry_exit_cond_resched is
> in fact called now, not schedule(). So likely __schedule(PREEMPT) is
> still called frequently, or even more frequently. So the code comment
> looks incorrect on the RT argument part. We probably should remove the
> "IS_ENABLED" condition really. Paul and Sebastian, does this sound
> reasonable to you?

Can you walk me through it? Why is it so important for a task to give up
voluntary? There is something wrong here with how RCU tasks works.
We want to get rid of the sprinkled cond_resched(). This looks like a
another variant of it that might be required in places with no
explanation except it takes too long. 

> Yan

Sebastian

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 net 1/3] rcu: add a helper to report consolidated flavor QS
  2024-04-05 13:49         ` Sebastian Andrzej Siewior
@ 2024-04-05 18:13           ` Paul E. McKenney
  0 siblings, 0 replies; 17+ messages in thread
From: Paul E. McKenney @ 2024-04-05 18:13 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Yan Zhai, netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jiri Pirko, Simon Horman, Daniel Borkmann,
	Lorenzo Bianconi, Coco Li, Wei Wang, Alexander Duyck,
	linux-kernel, rcu, bpf, kernel-team, Joel Fernandes,
	Toke Høiland-Jørgensen, Alexei Starovoitov,
	Steven Rostedt, mark.rutland, Jesper Dangaard Brouer,
	Thomas Gleixner

On Fri, Apr 05, 2024 at 03:49:46PM +0200, Sebastian Andrzej Siewior wrote:
> On 2024-03-22 21:02:02 [-0500], Yan Zhai wrote:
> > On Fri, Mar 22, 2024 at 4:31 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > On Fri, Mar 22, 2024 at 12:24:13PM +0100, Sebastian Andrzej Siewior wrote:
> > > > On 2024-03-19 13:44:34 [-0700], Yan Zhai wrote:
> > > > > + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would
> > > > > + * have more chance to invoke schedule() calls and provide necessary quiescent
> > > > > + * states. As a contrast, calling cond_resched() only won't achieve the same
> > > > > + * effect because cond_resched() does not provide RCU-Tasks quiescent states.
> > > > > + */
> > > >
> > > > Paul, so CONFIG_PREEMPTION is affected but CONFIG_PREEMPT_RT is not.
> > > > Why does RT have more scheduling points?
> > >
> > > In RT, isn't BH-disabled code preemptible?  But yes, this would not help
> > > RCU Tasks.
> Yes, it is but why does it matter? This is used in the NAPI thread which
> fully preemptible and does cond_resched(). This should be enough.

By the time it gets to RCU, a cond_resched()-induced context switch looks
like a preemption.  This is fine for normal RCU, but Tasks RCU needs
a voluntary context switch for a quiescent state.  Which makes sense,
given that code running in a trampoline that is invoked from preemptible
code can itself be preempted.

So that additional call to rcu_softirq_qs() is needed.  (Which invokes
rcu_tasks_qs() which in turn invokes rcu_tasks_classic_qs().

And maybe it is also needed for RT.  The argument against would be that
RT applications have significant idle time on the one hand or spend lots
of time in nohz_full userspace on the other, both of which are quiescent
states for RCU Tasks.  But you tell me!

> > By "more chance to invoke schedule()", my thought was that
> > cond_resched becomes no op on RT or PREEMPT kernel. So it will not
> > call __schedule(SM_PEREEMPT), which clears the NEED_RESCHED flag. On a
> It will nop cond_resched(), correct. However once something sends
> NEED_RESCHED then the receiver of this flag will __schedule(SM_PEREEMPT)
> as soon as possible. That is either because the scheduler sends an IPI
> and the CPU will do it in the irq-exit path _or_ the thread does
> preempt_enable() (which includes local_bh_enable()) and the counter hits
> zero at which point the same context switch happens.
> 
> Therefore I don't see a difference between CONFIG_PREEMPT and
> CONFIG_PREEMPT_RT.

True, but again RCU Tasks needs a voluntary context switch and the
resulting preemption therefore does not qualify.

> > normal irq exit like timer, when NEED_RESCHED is on,
> > schedule()/__schedule(0) can be called time by time then.
> 
> This I can not parse. __schedule(0) means the task gives up on its own
> and goes to sleep. This does not happen for the NAPI-thread loop,
> kworker loop or any other loop that consumes one work item after the
> other and relies on cond_resched() in between.
> 
> > __schedule(0) is good for RCU tasks, __schedule(SM_PREEMPT) is not.
> Okay and that is why? This means you expect that every thread gives up
> on its own which may take some time depending on the workload. This
> should not matter.
> 
> If I see this right, the only difference is rcu_tasks_classic_qs() and I
> didn't figure out yet what it does.

It marks the task as having passed through a Tasks RCU quiescent state.
Which works because this is called from task level (as opposed to from
irq, softirq, or NMI), so it cannot be returning to a trampoline that
is protected by Tasks RCU.

Later on, the Tasks RCU grace-period kthread will see the marking, and
remove this task from the list that is blocking the current Tasks-RCU
grace period.

> > But I think this code comment does not take into account frequent
> > preempt_schedule and irqentry_exit_cond_resched on a PREEMPT kernel.
> > When returning to these busy kthreads, irqentry_exit_cond_resched is
> > in fact called now, not schedule(). So likely __schedule(PREEMPT) is
> > still called frequently, or even more frequently. So the code comment
> > looks incorrect on the RT argument part. We probably should remove the
> > "IS_ENABLED" condition really. Paul and Sebastian, does this sound
> > reasonable to you?
> 
> Can you walk me through it? Why is it so important for a task to give up
> voluntary? There is something wrong here with how RCU tasks works.
> We want to get rid of the sprinkled cond_resched(). This looks like a
> another variant of it that might be required in places with no
> explanation except it takes too long. 

Hmmm...  I would normally point you at the Requirements.rst [1] document
for a walkthrough, but it does not cover all of this.

How about the upgrade shown below?

I agree that it would be nice if Tasks RCU did not have to depend on
voluntary context switches.  In theory, one alternative would be to
examine each task's stack, looking for return addresses in trampolines.
In practice, we recently took a look at this and other alternatives,
but none were feasible [2].  If you know of a better way, please do not
keep it a secret!

Note that kernel live patching has similar needs, and may need
similar annotations/innovations.

							Thanx, Paul

[1] Documentation/RCU/Design/Requirements/Requirements.rst
[2] https://docs.google.com/document/d/1kZY6AX-AHRIyYQsvUX6WJxS1LsDK4JA2CHuBnpkrR_U/edit?usp=sharing

------------------------------------------------------------------------
As written:
------------------------------------------------------------------------

Tasks RCU
~~~~~~~~~

Some forms of tracing use “trampolines” to handle the binary rewriting
required to install different types of probes. It would be good to be
able to free old trampolines, which sounds like a job for some form of
RCU. However, because it is necessary to be able to install a trace
anywhere in the code, it is not possible to use read-side markers such
as rcu_read_lock() and rcu_read_unlock(). In addition, it does
not work to have these markers in the trampoline itself, because there
would need to be instructions following rcu_read_unlock(). Although
synchronize_rcu() would guarantee that execution reached the
rcu_read_unlock(), it would not be able to guarantee that execution
had completely left the trampoline. Worse yet, in some situations
the trampoline's protection must extend a few instructions *prior* to
execution reaching the trampoline.  For example, these few instructions
might calculate the address of the trampoline, so that entering the
trampoline would be pre-ordained a surprisingly long time before execution
actually reached the trampoline itself.

The solution, in the form of `Tasks
RCU <https://lwn.net/Articles/607117/>`__, is to have implicit read-side
critical sections that are delimited by voluntary context switches, that
is, calls to schedule(), cond_resched(), and
synchronize_rcu_tasks(). In addition, transitions to and from
userspace execution also delimit tasks-RCU read-side critical sections.
Idle tasks are ignored by Tasks RCU, and Tasks Rude RCU may be used to
interact with them.

Note well that involuntary context switches are *not* Tasks-RCU quiescent
states.  After all, in preemptible kernels, a task executing code in a
trampoline might be preempted.  In this case, the Tasks-RCU grace period
clearly cannot end until that task resumes and its execution leaves that
trampoline.  This means, among other things, that cond_resched() does
not provide a Tasks RCU quiescent state.  (Instead, use rcu_softirq_qs()
from softirq or rcu_tasks_classic_qs() otherwise.)

The tasks-RCU API is quite compact, consisting only of
call_rcu_tasks(), synchronize_rcu_tasks(), and
rcu_barrier_tasks(). In ``CONFIG_PREEMPTION=n`` kernels, trampolines
cannot be preempted, so these APIs map to call_rcu(),
synchronize_rcu(), and rcu_barrier(), respectively. In
``CONFIG_PREEMPTION=y`` kernels, trampolines can be preempted, and these
three APIs are therefore implemented by separate functions that check
for voluntary context switches.

------------------------------------------------------------------------
As patch:
------------------------------------------------------------------------

diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst
index cccafdaa1f849..f511476b45506 100644
--- a/Documentation/RCU/Design/Requirements/Requirements.rst
+++ b/Documentation/RCU/Design/Requirements/Requirements.rst
@@ -2357,6 +2357,7 @@ section.
 #. `Sched Flavor (Historical)`_
 #. `Sleepable RCU`_
 #. `Tasks RCU`_
+#. `Tasks Trace RCU`_

 Bottom-Half Flavor (Historical)
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -2610,6 +2611,16 @@ critical sections that are delimited by voluntary context switches, that
 is, calls to schedule(), cond_resched(), and
 synchronize_rcu_tasks(). In addition, transitions to and from
 userspace execution also delimit tasks-RCU read-side critical sections.
+Idle tasks are ignored by Tasks RCU, and Tasks Rude RCU may be used to
+interact with them.
+
+Note well that involuntary context switches are *not* Tasks-RCU quiescent
+states.  After all, in preemptible kernels, a task executing code in a
+trampoline might be preempted.  In this case, the Tasks-RCU grace period
+clearly cannot end until that task resumes and its execution leaves that
+trampoline.  This means, among other things, that cond_resched() does
+not provide a Tasks RCU quiescent state.  (Instead, use rcu_softirq_qs()
+from softirq or rcu_tasks_classic_qs() otherwise.)

 The tasks-RCU API is quite compact, consisting only of
 call_rcu_tasks(), synchronize_rcu_tasks(), and
@@ -2632,6 +2643,11 @@ moniker.  And this operation is considered to be quite rude by real-time
 workloads that don't want their ``nohz_full`` CPUs receiving IPIs and
 by battery-powered systems that don't want their idle CPUs to be awakened.

+Once kernel entry/exit and deep-idle functions have been properly tagged
+``noinstr``, Tasks RCU can start paying attention to idle tasks (except
+those that are idle from RCU's perspective) and then Tasks Rude RCU can
+be removed from the kernel.
+
 The tasks-rude-RCU API is also reader-marking-free and thus quite compact,
 consisting of call_rcu_tasks_rude(), synchronize_rcu_tasks_rude(),
 and rcu_barrier_tasks_rude().

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v5 net 2/3] net: report RCU QS on threaded NAPI repolling
  2024-03-19 20:44 [PATCH v5 net 0/3] Report RCU QS for busy network kthreads Yan Zhai
  2024-03-19 20:44 ` [PATCH v5 net 1/3] rcu: add a helper to report consolidated flavor QS Yan Zhai
@ 2024-03-19 20:44 ` Yan Zhai
  2024-03-19 21:32   ` Paul E. McKenney
  2024-03-19 20:44 ` [PATCH v5 net 3/3] bpf: report RCU QS in cpumap kthread Yan Zhai
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Yan Zhai @ 2024-03-19 20:44 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jiri Pirko, Simon Horman, Daniel Borkmann, Lorenzo Bianconi,
	Coco Li, Wei Wang, Alexander Duyck, linux-kernel, rcu, bpf,
	kernel-team, Joel Fernandes, Paul E. McKenney,
	Toke Høiland-Jørgensen, Alexei Starovoitov,
	Steven Rostedt, mark.rutland, Jesper Dangaard Brouer,
	Sebastian Andrzej Siewior

NAPI threads can keep polling packets under load. Currently it is only
calling cond_resched() before repolling, but it is not sufficient to
clear out the holdout of RCU tasks, which prevent BPF tracing programs
from detaching for long period. This can be reproduced easily with
following set up:

ip netns add test1
ip netns add test2

ip -n test1 link add veth1 type veth peer name veth2 netns test2

ip -n test1 link set veth1 up
ip -n test1 link set lo up
ip -n test2 link set veth2 up
ip -n test2 link set lo up

ip -n test1 addr add 192.168.1.2/31 dev veth1
ip -n test1 addr add 1.1.1.1/32 dev lo
ip -n test2 addr add 192.168.1.3/31 dev veth2
ip -n test2 addr add 2.2.2.2/31 dev lo

ip -n test1 route add default via 192.168.1.3
ip -n test2 route add default via 192.168.1.2

for i in `seq 10 210`; do
 for j in `seq 10 210`; do
    ip netns exec test2 iptables -I INPUT -s 3.3.$i.$j -p udp --dport 5201
 done
done

ip netns exec test2 ethtool -K veth2 gro on
ip netns exec test2 bash -c 'echo 1 > /sys/class/net/veth2/threaded'
ip netns exec test1 ethtool -K veth1 tso off

Then run an iperf3 client/server and a bpftrace script can trigger it:

ip netns exec test2 iperf3 -s -B 2.2.2.2 >/dev/null&
ip netns exec test1 iperf3 -c 2.2.2.2 -B 1.1.1.1 -u -l 1500 -b 3g -t 100 >/dev/null&
bpftrace -e 'kfunc:__napi_poll{@=count();} interval:s:1{exit();}'

Report RCU quiescent states periodically will resolve the issue.

Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support")
Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Yan Zhai <yan@cloudflare.com>
---
v2->v3: abstracted the work into a RCU helper
v1->v2: moved rcu_softirq_qs out from bh critical section, and only
raise it after a second of repolling. Added some brief perf test result.

v2: https://lore.kernel.org/bpf/ZeFPz4D121TgvCje@debian.debian/
v1: https://lore.kernel.org/lkml/Zd4DXTyCf17lcTfq@debian.debian/#t
---
 net/core/dev.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 303a6ff46e4e..9a67003e49db 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6743,6 +6743,8 @@ static int napi_threaded_poll(void *data)
 	void *have;
 
 	while (!napi_thread_wait(napi)) {
+		unsigned long last_qs = jiffies;
+
 		for (;;) {
 			bool repoll = false;
 
@@ -6767,6 +6769,7 @@ static int napi_threaded_poll(void *data)
 			if (!repoll)
 				break;
 
+			rcu_softirq_qs_periodic(last_qs);
 			cond_resched();
 		}
 	}
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 net 2/3] net: report RCU QS on threaded NAPI repolling
  2024-03-19 20:44 ` [PATCH v5 net 2/3] net: report RCU QS on threaded NAPI repolling Yan Zhai
@ 2024-03-19 21:32   ` Paul E. McKenney
  0 siblings, 0 replies; 17+ messages in thread
From: Paul E. McKenney @ 2024-03-19 21:32 UTC (permalink / raw)
  To: Yan Zhai
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jiri Pirko, Simon Horman, Daniel Borkmann,
	Lorenzo Bianconi, Coco Li, Wei Wang, Alexander Duyck,
	linux-kernel, rcu, bpf, kernel-team, Joel Fernandes,
	Toke Høiland-Jørgensen, Alexei Starovoitov,
	Steven Rostedt, mark.rutland, Jesper Dangaard Brouer,
	Sebastian Andrzej Siewior

On Tue, Mar 19, 2024 at 01:44:37PM -0700, Yan Zhai wrote:
> NAPI threads can keep polling packets under load. Currently it is only
> calling cond_resched() before repolling, but it is not sufficient to
> clear out the holdout of RCU tasks, which prevent BPF tracing programs
> from detaching for long period. This can be reproduced easily with
> following set up:
> 
> ip netns add test1
> ip netns add test2
> 
> ip -n test1 link add veth1 type veth peer name veth2 netns test2
> 
> ip -n test1 link set veth1 up
> ip -n test1 link set lo up
> ip -n test2 link set veth2 up
> ip -n test2 link set lo up
> 
> ip -n test1 addr add 192.168.1.2/31 dev veth1
> ip -n test1 addr add 1.1.1.1/32 dev lo
> ip -n test2 addr add 192.168.1.3/31 dev veth2
> ip -n test2 addr add 2.2.2.2/31 dev lo
> 
> ip -n test1 route add default via 192.168.1.3
> ip -n test2 route add default via 192.168.1.2
> 
> for i in `seq 10 210`; do
>  for j in `seq 10 210`; do
>     ip netns exec test2 iptables -I INPUT -s 3.3.$i.$j -p udp --dport 5201
>  done
> done
> 
> ip netns exec test2 ethtool -K veth2 gro on
> ip netns exec test2 bash -c 'echo 1 > /sys/class/net/veth2/threaded'
> ip netns exec test1 ethtool -K veth1 tso off
> 
> Then run an iperf3 client/server and a bpftrace script can trigger it:
> 
> ip netns exec test2 iperf3 -s -B 2.2.2.2 >/dev/null&
> ip netns exec test1 iperf3 -c 2.2.2.2 -B 1.1.1.1 -u -l 1500 -b 3g -t 100 >/dev/null&
> bpftrace -e 'kfunc:__napi_poll{@=count();} interval:s:1{exit();}'
> 
> Report RCU quiescent states periodically will resolve the issue.
> 
> Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support")
> Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
> Signed-off-by: Yan Zhai <yan@cloudflare.com>

Acked-by: Paul E. McKenney <paulmck@kernel.org>

> ---
> v2->v3: abstracted the work into a RCU helper
> v1->v2: moved rcu_softirq_qs out from bh critical section, and only
> raise it after a second of repolling. Added some brief perf test result.
> 
> v2: https://lore.kernel.org/bpf/ZeFPz4D121TgvCje@debian.debian/
> v1: https://lore.kernel.org/lkml/Zd4DXTyCf17lcTfq@debian.debian/#t
> ---
>  net/core/dev.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 303a6ff46e4e..9a67003e49db 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -6743,6 +6743,8 @@ static int napi_threaded_poll(void *data)
>  	void *have;
>  
>  	while (!napi_thread_wait(napi)) {
> +		unsigned long last_qs = jiffies;
> +
>  		for (;;) {
>  			bool repoll = false;
>  
> @@ -6767,6 +6769,7 @@ static int napi_threaded_poll(void *data)
>  			if (!repoll)
>  				break;
>  
> +			rcu_softirq_qs_periodic(last_qs);
>  			cond_resched();
>  		}
>  	}
> -- 
> 2.30.2
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v5 net 3/3] bpf: report RCU QS in cpumap kthread
  2024-03-19 20:44 [PATCH v5 net 0/3] Report RCU QS for busy network kthreads Yan Zhai
  2024-03-19 20:44 ` [PATCH v5 net 1/3] rcu: add a helper to report consolidated flavor QS Yan Zhai
  2024-03-19 20:44 ` [PATCH v5 net 2/3] net: report RCU QS on threaded NAPI repolling Yan Zhai
@ 2024-03-19 20:44 ` Yan Zhai
  2024-03-19 21:32   ` Paul E. McKenney
  2024-03-20 10:30 ` [PATCH v5 net 0/3] Report RCU QS for busy network kthreads Jesper Dangaard Brouer
  2024-03-21  4:40 ` patchwork-bot+netdevbpf
  4 siblings, 1 reply; 17+ messages in thread
From: Yan Zhai @ 2024-03-19 20:44 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jiri Pirko, Simon Horman, Daniel Borkmann, Lorenzo Bianconi,
	Coco Li, Wei Wang, Alexander Duyck, linux-kernel, rcu, bpf,
	kernel-team, Joel Fernandes, Paul E. McKenney,
	Toke Høiland-Jørgensen, Alexei Starovoitov,
	Steven Rostedt, mark.rutland, Jesper Dangaard Brouer,
	Sebastian Andrzej Siewior

When there are heavy load, cpumap kernel threads can be busy polling
packets from redirect queues and block out RCU tasks from reaching
quiescent states. It is insufficient to just call cond_resched() in such
context. Periodically raise a consolidated RCU QS before cond_resched
fixes the problem.

Fixes: 6710e1126934 ("bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP")
Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Yan Zhai <yan@cloudflare.com>
---
 kernel/bpf/cpumap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 9ee8da477465..a8e34416e960 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -263,6 +263,7 @@ static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
 static int cpu_map_kthread_run(void *data)
 {
 	struct bpf_cpu_map_entry *rcpu = data;
+	unsigned long last_qs = jiffies;
 
 	complete(&rcpu->kthread_running);
 	set_current_state(TASK_INTERRUPTIBLE);
@@ -288,10 +289,12 @@ static int cpu_map_kthread_run(void *data)
 			if (__ptr_ring_empty(rcpu->queue)) {
 				schedule();
 				sched = 1;
+				last_qs = jiffies;
 			} else {
 				__set_current_state(TASK_RUNNING);
 			}
 		} else {
+			rcu_softirq_qs_periodic(last_qs);
 			sched = cond_resched();
 		}
 
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 net 3/3] bpf: report RCU QS in cpumap kthread
  2024-03-19 20:44 ` [PATCH v5 net 3/3] bpf: report RCU QS in cpumap kthread Yan Zhai
@ 2024-03-19 21:32   ` Paul E. McKenney
  0 siblings, 0 replies; 17+ messages in thread
From: Paul E. McKenney @ 2024-03-19 21:32 UTC (permalink / raw)
  To: Yan Zhai
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jiri Pirko, Simon Horman, Daniel Borkmann,
	Lorenzo Bianconi, Coco Li, Wei Wang, Alexander Duyck,
	linux-kernel, rcu, bpf, kernel-team, Joel Fernandes,
	Toke Høiland-Jørgensen, Alexei Starovoitov,
	Steven Rostedt, mark.rutland, Jesper Dangaard Brouer,
	Sebastian Andrzej Siewior

On Tue, Mar 19, 2024 at 01:44:40PM -0700, Yan Zhai wrote:
> When there are heavy load, cpumap kernel threads can be busy polling
> packets from redirect queues and block out RCU tasks from reaching
> quiescent states. It is insufficient to just call cond_resched() in such
> context. Periodically raise a consolidated RCU QS before cond_resched
> fixes the problem.
> 
> Fixes: 6710e1126934 ("bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP")
> Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
> Signed-off-by: Yan Zhai <yan@cloudflare.com>

Acked-by: Paul E. McKenney <paulmck@kernel.org>

> ---
>  kernel/bpf/cpumap.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
> index 9ee8da477465..a8e34416e960 100644
> --- a/kernel/bpf/cpumap.c
> +++ b/kernel/bpf/cpumap.c
> @@ -263,6 +263,7 @@ static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
>  static int cpu_map_kthread_run(void *data)
>  {
>  	struct bpf_cpu_map_entry *rcpu = data;
> +	unsigned long last_qs = jiffies;
>  
>  	complete(&rcpu->kthread_running);
>  	set_current_state(TASK_INTERRUPTIBLE);
> @@ -288,10 +289,12 @@ static int cpu_map_kthread_run(void *data)
>  			if (__ptr_ring_empty(rcpu->queue)) {
>  				schedule();
>  				sched = 1;
> +				last_qs = jiffies;
>  			} else {
>  				__set_current_state(TASK_RUNNING);
>  			}
>  		} else {
> +			rcu_softirq_qs_periodic(last_qs);
>  			sched = cond_resched();
>  		}
>  
> -- 
> 2.30.2
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 net 0/3] Report RCU QS for busy network kthreads
  2024-03-19 20:44 [PATCH v5 net 0/3] Report RCU QS for busy network kthreads Yan Zhai
                   ` (2 preceding siblings ...)
  2024-03-19 20:44 ` [PATCH v5 net 3/3] bpf: report RCU QS in cpumap kthread Yan Zhai
@ 2024-03-20 10:30 ` Jesper Dangaard Brouer
  2024-03-21  4:40 ` patchwork-bot+netdevbpf
  4 siblings, 0 replies; 17+ messages in thread
From: Jesper Dangaard Brouer @ 2024-03-20 10:30 UTC (permalink / raw)
  To: Yan Zhai, netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jiri Pirko, Simon Horman, Daniel Borkmann, Lorenzo Bianconi,
	Coco Li, Wei Wang, Alexander Duyck, linux-kernel, rcu, bpf,
	kernel-team, Joel Fernandes, Paul E. McKenney,
	Toke Høiland-Jørgensen, Alexei Starovoitov,
	Steven Rostedt, mark.rutland, Sebastian Andrzej Siewior


On 19/03/2024 21.44, Yan Zhai wrote:
> This changeset fixes a common problem for busy networking kthreads.
> These threads, e.g. NAPI threads, typically will do:
> 
> * polling a batch of packets
> * if there are more work, call cond_resched() to allow scheduling
> * continue to poll more packets when rx queue is not empty
> 
> We observed this being a problem in production, since it can block RCU
> tasks from making progress under heavy load. Investigation indicates
> that just calling cond_resched() is insufficient for RCU tasks to reach
> quiescent states. This also has the side effect of frequently clearing
> the TIF_NEED_RESCHED flag on voluntary preempt kernels. As a result,
> schedule() will not be called in these circumstances, despite schedule()
> in fact provides required quiescent states. This at least affects NAPI
> threads, napi_busy_loop, and also cpumap kthread.
> 
> By reporting RCU QSes in these kthreads periodically before cond_resched, the
> blocked RCU waiters can correctly progress. Instead of just reporting QS for
> RCU tasks, these code share the same concern as noted in the commit
> d28139c4e967 ("rcu: Apply RCU-bh QSes to RCU-sched and RCU-preempt when safe").
> So report a consolidated QS for safety.
> 
> It is worth noting that, although this problem is reproducible in
> napi_busy_loop, it only shows up when setting the polling interval to as high
> as 2ms, which is far larger than recommended 50us-100us in the documentation.
> So napi_busy_loop is left untouched.
> 
> Lastly, this does not affect RT kernels, which does not enter the scheduler
> through cond_resched(). Without the mentioned side effect, schedule() will
> be called time by time, and clear the RCU task holdouts.
> 
> V4: https://lore.kernel.org/bpf/cover.1710525524.git.yan@cloudflare.com/
> V3: https://lore.kernel.org/lkml/20240314145459.7b3aedf1@kernel.org/t/
> V2: https://lore.kernel.org/bpf/ZeFPz4D121TgvCje@debian.debian/
> V1: https://lore.kernel.org/lkml/Zd4DXTyCf17lcTfq@debian.debian/#t
> 
> changes since v4:
>   * polished comments and docs for the RCU helper as Paul McKenney suggested
> 
> changes since v3:
>   * fixed kernel-doc errors
> 
> changes since v2:
>   * created a helper in rcu header to abstract the behavior
>   * fixed cpumap kthread in addition
> 
> changes since v1:
>   * disable preemption first as Paul McKenney suggested
> 
> Yan Zhai (3):
>    rcu: add a helper to report consolidated flavor QS
>    net: report RCU QS on threaded NAPI repolling
>    bpf: report RCU QS in cpumap kthread
> 
>   include/linux/rcupdate.h | 31 +++++++++++++++++++++++++++++++
>   kernel/bpf/cpumap.c      |  3 +++
>   net/core/dev.c           |  3 +++
>   3 files changed, 37 insertions(+)
> 

Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 net 0/3] Report RCU QS for busy network kthreads
  2024-03-19 20:44 [PATCH v5 net 0/3] Report RCU QS for busy network kthreads Yan Zhai
                   ` (3 preceding siblings ...)
  2024-03-20 10:30 ` [PATCH v5 net 0/3] Report RCU QS for busy network kthreads Jesper Dangaard Brouer
@ 2024-03-21  4:40 ` patchwork-bot+netdevbpf
  4 siblings, 0 replies; 17+ messages in thread
From: patchwork-bot+netdevbpf @ 2024-03-21  4:40 UTC (permalink / raw)
  To: Yan Zhai
  Cc: netdev, davem, edumazet, kuba, pabeni, jiri, horms, daniel,
	lorenzo, lixiaoyan, weiwan, alexanderduyck, linux-kernel, rcu,
	bpf, kernel-team, joel, paulmck, toke, alexei.starovoitov,
	rostedt, mark.rutland, hawk, bigeasy

Hello:

This series was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Tue, 19 Mar 2024 13:44:30 -0700 you wrote:
> This changeset fixes a common problem for busy networking kthreads.
> These threads, e.g. NAPI threads, typically will do:
> 
> * polling a batch of packets
> * if there are more work, call cond_resched() to allow scheduling
> * continue to poll more packets when rx queue is not empty
> 
> [...]

Here is the summary with links:
  - [v5,net,1/3] rcu: add a helper to report consolidated flavor QS
    https://git.kernel.org/netdev/net/c/1a77557d48cf
  - [v5,net,2/3] net: report RCU QS on threaded NAPI repolling
    https://git.kernel.org/netdev/net/c/d6dbbb11247c
  - [v5,net,3/3] bpf: report RCU QS in cpumap kthread
    https://git.kernel.org/netdev/net/c/00bf63122459

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2024-04-05 18:13 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-03-19 20:44 [PATCH v5 net 0/3] Report RCU QS for busy network kthreads Yan Zhai
2024-03-19 20:44 ` [PATCH v5 net 1/3] rcu: add a helper to report consolidated flavor QS Yan Zhai
2024-03-19 21:31   ` Paul E. McKenney
2024-03-19 22:00     ` Yan Zhai
2024-03-19 22:08       ` Paul E. McKenney
2024-03-22 11:24   ` Sebastian Andrzej Siewior
2024-03-22 21:30     ` Paul E. McKenney
2024-03-23  2:02       ` Yan Zhai
2024-03-23 23:53         ` Paul E. McKenney
2024-04-05 13:49         ` Sebastian Andrzej Siewior
2024-04-05 18:13           ` Paul E. McKenney
2024-03-19 20:44 ` [PATCH v5 net 2/3] net: report RCU QS on threaded NAPI repolling Yan Zhai
2024-03-19 21:32   ` Paul E. McKenney
2024-03-19 20:44 ` [PATCH v5 net 3/3] bpf: report RCU QS in cpumap kthread Yan Zhai
2024-03-19 21:32   ` Paul E. McKenney
2024-03-20 10:30 ` [PATCH v5 net 0/3] Report RCU QS for busy network kthreads Jesper Dangaard Brouer
2024-03-21  4:40 ` patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).