[PATCH v3 0/7] Reduce synchronize

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 0/7] Reduce synchronize_rcu() latency(V3)
@ 2023-11-28  8:00 Uladzislau Rezki (Sony)
  2023-11-28  8:00 ` [PATCH v3 1/7] rcu: Reduce synchronize_rcu() latency Uladzislau Rezki (Sony)
                   ` (6 more replies)
  0 siblings, 7 replies; 31+ messages in thread
From: Uladzislau Rezki (Sony) @ 2023-11-28  8:00 UTC (permalink / raw)
  To: Paul E . McKenney
  Cc: RCU, Neeraj upadhyay, Boqun Feng, Hillf Danton, Joel Fernandes,
	LKML, Uladzislau Rezki, Oleksiy Avramchenko, Frederic Weisbecker

This is V3.

V2 -> V3:
 - use dummy wait node to eliminate cache misses during a full llist
   scan to obtain a tail node;
 - add CONFIG_RCU_SR_NORMAL_DEBUG_GP Kconfig for a GP incomplete debug;
 - move synchronize_rcu()'s main control data under the rcu_state struct;
 - add rcutree.rcu_normal_wake_from_gp parameter description to the
   kernel-parameters.txt file;
 - split functionality into several patches.

V2: https://lore.kernel.org/all/20231030131254.488186-1-urezki@gmail.com/T/
V1: https://lore.kernel.org/lkml/20231025140915.590390-1-urezki@gmail.com/T/

Neeraj Upadhyay (1):
  rcu: Improve handling of synchronize_rcu() users

Uladzislau Rezki (Sony) (6):
  rcu: Reduce synchronize_rcu() latency
  rcu: Add a trace event for synchronize_rcu_normal()
  doc: Add rcutree.rcu_normal_wake_from_gp to kernel-parameters.txt
  rcu: Support direct wake-up of synchronize_rcu() users
  rcu: Move sync related data to rcu_state structure
  rcu: Add CONFIG_RCU_SR_NORMAL_DEBUG_GP

 .../admin-guide/kernel-parameters.txt         |  14 +
 include/trace/events/rcu.h                    |  27 ++
 kernel/rcu/Kconfig.debug                      |  12 +
 kernel/rcu/tree.c                             | 354 +++++++++++++++++-
 kernel/rcu/tree.h                             |  19 +
 kernel/rcu/tree_exp.h                         |   2 +-
 6 files changed, 426 insertions(+), 2 deletions(-)

-- 
2.39.2


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v3 1/7] rcu: Reduce synchronize_rcu() latency
  2023-11-28  8:00 [PATCH v3 0/7] Reduce synchronize_rcu() latency(V3) Uladzislau Rezki (Sony)
@ 2023-11-28  8:00 ` Uladzislau Rezki (Sony)
  2023-11-28  8:00 ` [PATCH v3 2/7] rcu: Add a trace event for synchronize_rcu_normal() Uladzislau Rezki (Sony)
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 31+ messages in thread
From: Uladzislau Rezki (Sony) @ 2023-11-28  8:00 UTC (permalink / raw)
  To: Paul E . McKenney
  Cc: RCU, Neeraj upadhyay, Boqun Feng, Hillf Danton, Joel Fernandes,
	LKML, Uladzislau Rezki, Oleksiy Avramchenko, Frederic Weisbecker

A call to a synchronize_rcu() can be optimized from a latency
point of view. Workloads which depend on this can benefit of it.

The delay of wakeme_after_rcu() callback, which unblocks a waiter,
depends on several factors:

- how fast a process of offloading is started. Combination of:
    - !CONFIG_RCU_NOCB_CPU/CONFIG_RCU_NOCB_CPU;
    - !CONFIG_RCU_LAZY/CONFIG_RCU_LAZY;
    - other.
- when started, invoking path is interrupted due to:
    - time limit;
    - need_resched();
    - if limit is reached.
- where in a nocb list it is located;
- how fast previous callbacks completed;

Example:

1. On our embedded devices i can easily trigger the scenario when
it is a last in the list out of ~3600 callbacks:

<snip>
  <...>-29      [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28
...
  <...>-29      [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt
  <...>-29      [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt
  <...>-29      [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt
  <...>-29      [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt
  <...>-29      [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt
  <...>-29      [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt
  <...>-29      [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=....
<snip>

2. We use cpuset/cgroup to classify tasks and assign them into
different cgroups. For example "backgrond" group which binds tasks
only to little CPUs or "foreground" which makes use of all CPUs.
Tasks can be migrated between groups by a request if an acceleration
is needed.

See below an example how "surfaceflinger" task gets migrated.
Initially it is located in the "system-background" cgroup which
allows to run only on little cores. In order to speed it up it
can be temporary moved into "foreground" cgroup which allows
to use big/all CPUs:

cgroup_attach_task():
 -> cgroup_migrate_execute()
   -> cpuset_can_attach()
     -> percpu_down_write()
       -> rcu_sync_enter()
         -> synchronize_rcu()
   -> now move tasks to the new cgroup.
 -> cgroup_migrate_finish()

<snip>
         rcuop/1-29      [000] .....  7030.528570: rcu_invoke_callback: rcu_preempt rhp=00000000461605e0 func=wakeme_after_rcu.cfi_jt
    PERFD-SERVER-1855    [000] d..1.  7030.530293: cgroup_attach_task: dst_root=3 dst_id=22 dst_level=1 dst_path=/foreground pid=1900 comm=surfaceflinger
   TimerDispatch-2768    [002] d..5.  7030.537542: sched_migrate_task: comm=surfaceflinger pid=1900 prio=98 orig_cpu=0 dest_cpu=4
<snip>

"Boosting a task" depends on synchronize_rcu() latency:

- first trace shows a completion of synchronize_rcu();
- second shows attaching a task to a new group;
- last shows a final step when migration occurs.

3. To address this drawback, maintain a separate track that consists
of synchronize_rcu() callers only. After completion of a grace period
users are deferred to a dedicated worker to process requests.

4. This patch reduces the latency of synchronize_rcu() approximately
by ~30-40% on synthetic tests. The real test case, camera launch time,
shows(time is in milliseconds):

1-run 542 vs 489 improvement 9%
2-run 540 vs 466 improvement 13%
3-run 518 vs 468 improvement 9%
4-run 531 vs 457 improvement 13%
5-run 548 vs 475 improvement 13%
6-run 509 vs 484 improvement 4%

Synthetic test(no "noise" from other callbacks):
Hardware: x86_64 64 CPUs, 64GB of memory
Linux-6.6

- 10K tasks(simultaneous);
- each task does(1000 loops)
     synchronize_rcu();
     kfree(p);

default: CONFIG_RCU_NOCB_CPU: takes 54 seconds to complete all users;
patch: CONFIG_RCU_NOCB_CPU: takes 35 seconds to complete all users.

Running 60K gives approximately same results on my setup. Please note
it is without any interaction with another type of callbacks, otherwise
it will impact a lot a default case.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 kernel/rcu/tree.c     | 135 +++++++++++++++++++++++++++++++++++++++++-
 kernel/rcu/tree_exp.h |   2 +-
 2 files changed, 135 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index cb1caefa8bd0..57cfa467697b 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1384,6 +1384,105 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
 		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 }
 
+/*
+ * There are three lists for handling synchronize_rcu() users.
+ * A first list corresponds to new coming users, second for users
+ * which wait for a grace period and third is for which a grace
+ * period is passed.
+ */
+static struct sr_normal_state {
+	struct llist_head srs_next;	/* request a GP users. */
+	struct llist_head srs_wait;	/* wait for GP users. */
+	struct llist_head srs_done;	/* ready for GP users. */
+
+	/*
+	 * In order to add a batch of nodes to already
+	 * existing srs-done-list, a tail of srs-wait-list
+	 * is maintained.
+	 */
+	struct llist_node *srs_wait_tail;
+} sr;
+
+/* Disabled by default. */
+static int rcu_normal_wake_from_gp;
+module_param(rcu_normal_wake_from_gp, int, 0644);
+
+static void rcu_sr_normal_complete(struct llist_node *node)
+{
+	struct rcu_synchronize *rs = container_of(
+		(struct rcu_head *) node, struct rcu_synchronize, head);
+	unsigned long oldstate = (unsigned long) rs->head.func;
+
+	WARN_ONCE(!rcu_gp_is_expedited() && !poll_state_synchronize_rcu(oldstate),
+		"A full grace period is not passed yet: %lu",
+		rcu_seq_diff(get_state_synchronize_rcu(), oldstate));
+
+	/* Finally. */
+	complete(&rs->completion);
+}
+
+static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
+{
+	struct llist_node *done, *rcu, *next;
+
+	done = llist_del_all(&sr.srs_done);
+	if (!done)
+		return;
+
+	llist_for_each_safe(rcu, next, done)
+		rcu_sr_normal_complete(rcu);
+}
+static DECLARE_WORK(sr_normal_gp_cleanup, rcu_sr_normal_gp_cleanup_work);
+
+/*
+ * Helper function for rcu_gp_cleanup().
+ */
+static void rcu_sr_normal_gp_cleanup(void)
+{
+	struct llist_node *head, *tail;
+
+	if (llist_empty(&sr.srs_wait))
+		return;
+
+	tail = READ_ONCE(sr.srs_wait_tail);
+	head = __llist_del_all(&sr.srs_wait);
+
+	if (head) {
+		/* Can be not empty. */
+		llist_add_batch(head, tail, &sr.srs_done);
+		queue_work(system_highpri_wq, &sr_normal_gp_cleanup);
+	}
+}
+
+/*
+ * Helper function for rcu_gp_init().
+ */
+static void rcu_sr_normal_gp_init(void)
+{
+	struct llist_node *head, *tail;
+
+	if (llist_empty(&sr.srs_next))
+		return;
+
+	tail = llist_del_all(&sr.srs_next);
+	head = llist_reverse_order(tail);
+
+	/*
+	 * A waiting list of GP should be empty on this step,
+	 * since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
+	 * rolls it over. If not, it is a BUG, warn a user.
+	 */
+	WARN_ON_ONCE(!llist_empty(&sr.srs_wait));
+
+	WRITE_ONCE(sr.srs_wait_tail, tail);
+	__llist_add_batch(head, tail, &sr.srs_wait);
+}
+
+static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
+{
+	llist_add((struct llist_node *) &rs->head, &sr.srs_next);
+}
+
 /*
  * Initialize a new grace period.  Return false if no grace period required.
  */
@@ -1418,6 +1517,7 @@ static noinline_for_stack bool rcu_gp_init(void)
 	/* Record GP times before starting GP, hence rcu_seq_start(). */
 	rcu_seq_start(&rcu_state.gp_seq);
 	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
+	rcu_sr_normal_gp_init();
 	trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
 	rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
 	raw_spin_unlock_irq_rcu_node(rnp);
@@ -1775,6 +1875,9 @@ static noinline void rcu_gp_cleanup(void)
 	}
 	raw_spin_unlock_irq_rcu_node(rnp);
 
+	// Make synchronize_rcu() users aware of the end of old grace period.
+	rcu_sr_normal_gp_cleanup();
+
 	// If strict, make all CPUs aware of the end of the old grace period.
 	if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD))
 		on_each_cpu(rcu_strict_gp_boundary, NULL, 0);
@@ -3487,6 +3590,36 @@ static int rcu_blocking_is_gp(void)
 	return true;
 }
 
+/*
+ * Helper function for the synchronize_rcu() API.
+ */
+static void synchronize_rcu_normal(void)
+{
+	struct rcu_synchronize rs;
+
+	if (!READ_ONCE(rcu_normal_wake_from_gp)) {
+		wait_rcu_gp(call_rcu_hurry);
+		return;
+	}
+
+	init_rcu_head_on_stack(&rs.head);
+	init_completion(&rs.completion);
+
+	/*
+	 * This code might be preempted, therefore take a GP
+	 * snapshot before adding a request.
+	 */
+	rs.head.func = (void *) get_state_synchronize_rcu();
+	rcu_sr_normal_add_req(&rs);
+
+	/* Kick a GP and start waiting. */
+	(void) start_poll_synchronize_rcu();
+
+	/* Now we can wait. */
+	wait_for_completion(&rs.completion);
+	destroy_rcu_head_on_stack(&rs.head);
+}
+
 /**
  * synchronize_rcu - wait until a grace period has elapsed.
  *
@@ -3538,7 +3671,7 @@ void synchronize_rcu(void)
 		if (rcu_gp_is_expedited())
 			synchronize_rcu_expedited();
 		else
-			wait_rcu_gp(call_rcu_hurry);
+			synchronize_rcu_normal();
 		return;
 	}
 
diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 8239b39d945b..53cc1d389f96 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -983,7 +983,7 @@ void synchronize_rcu_expedited(void)
 
 	/* If expedited grace periods are prohibited, fall back to normal. */
 	if (rcu_gp_is_normal()) {
-		wait_rcu_gp(call_rcu_hurry);
+		synchronize_rcu_normal();
 		return;
 	}
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 2/7] rcu: Add a trace event for synchronize_rcu_normal()
  2023-11-28  8:00 [PATCH v3 0/7] Reduce synchronize_rcu() latency(V3) Uladzislau Rezki (Sony)
  2023-11-28  8:00 ` [PATCH v3 1/7] rcu: Reduce synchronize_rcu() latency Uladzislau Rezki (Sony)
@ 2023-11-28  8:00 ` Uladzislau Rezki (Sony)
  2023-11-28  8:00 ` [PATCH v3 3/7] doc: Add rcutree.rcu_normal_wake_from_gp to kernel-parameters.txt Uladzislau Rezki (Sony)
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 31+ messages in thread
From: Uladzislau Rezki (Sony) @ 2023-11-28  8:00 UTC (permalink / raw)
  To: Paul E . McKenney
  Cc: RCU, Neeraj upadhyay, Boqun Feng, Hillf Danton, Joel Fernandes,
	LKML, Uladzislau Rezki, Oleksiy Avramchenko, Frederic Weisbecker

Add an rcu_sr_normal() trace event. It takes three arguments
first one is the name of RCU flavour, second one is a user id
which triggeres synchronize_rcu_normal() and last one is an
event.

There are two traces in the synchronize_rcu_normal(). On entry,
when a new request is registered and on exit point when request
is completed.

Please note, CONFIG_RCU_TRACE=y is required to activate traces.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 include/trace/events/rcu.h | 27 +++++++++++++++++++++++++++
 kernel/rcu/tree.c          |  7 ++++++-
 2 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h
index 2ef9c719772a..31b3e0d3e65f 100644
--- a/include/trace/events/rcu.h
+++ b/include/trace/events/rcu.h
@@ -707,6 +707,33 @@ TRACE_EVENT_RCU(rcu_invoke_kfree_bulk_callback,
 		__entry->rcuname, __entry->p, __entry->nr_records)
 );
 
+/*
+ * Tracepoint for a normal synchronize_rcu() states. The first argument
+ * is the RCU flavor, the second argument is a pointer to rcu_head the
+ * last one is an event.
+ */
+TRACE_EVENT_RCU(rcu_sr_normal,
+
+	TP_PROTO(const char *rcuname, struct rcu_head *rhp, const char *srevent),
+
+	TP_ARGS(rcuname, rhp, srevent),
+
+	TP_STRUCT__entry(
+		__field(const char *, rcuname)
+		__field(void *, rhp)
+		__field(const char *, srevent)
+	),
+
+	TP_fast_assign(
+		__entry->rcuname = rcuname;
+		__entry->rhp = rhp;
+		__entry->srevent = srevent;
+	),
+
+	TP_printk("%s rhp=0x%p event=%s",
+		__entry->rcuname, __entry->rhp, __entry->srevent)
+);
+
 /*
  * Tracepoint for exiting rcu_do_batch after RCU callbacks have been
  * invoked.  The first argument is the name of the RCU flavor,
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 57cfa467697b..975621ef40e3 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3597,9 +3597,11 @@ static void synchronize_rcu_normal(void)
 {
 	struct rcu_synchronize rs;
 
+	trace_rcu_sr_normal(rcu_state.name, &rs.head, TPS("request"));
+
 	if (!READ_ONCE(rcu_normal_wake_from_gp)) {
 		wait_rcu_gp(call_rcu_hurry);
-		return;
+		goto trace_complete_out;
 	}
 
 	init_rcu_head_on_stack(&rs.head);
@@ -3618,6 +3620,9 @@ static void synchronize_rcu_normal(void)
 	/* Now we can wait. */
 	wait_for_completion(&rs.completion);
 	destroy_rcu_head_on_stack(&rs.head);
+
+trace_complete_out:
+	trace_rcu_sr_normal(rcu_state.name, &rs.head, TPS("complete"));
 }
 
 /**
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 3/7] doc: Add rcutree.rcu_normal_wake_from_gp to kernel-parameters.txt
  2023-11-28  8:00 [PATCH v3 0/7] Reduce synchronize_rcu() latency(V3) Uladzislau Rezki (Sony)
  2023-11-28  8:00 ` [PATCH v3 1/7] rcu: Reduce synchronize_rcu() latency Uladzislau Rezki (Sony)
  2023-11-28  8:00 ` [PATCH v3 2/7] rcu: Add a trace event for synchronize_rcu_normal() Uladzislau Rezki (Sony)
@ 2023-11-28  8:00 ` Uladzislau Rezki (Sony)
  2023-12-20  1:17   ` Paul E. McKenney
  2023-11-28  8:00 ` [PATCH v3 4/7] rcu: Improve handling of synchronize_rcu() users Uladzislau Rezki (Sony)
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 31+ messages in thread
From: Uladzislau Rezki (Sony) @ 2023-11-28  8:00 UTC (permalink / raw)
  To: Paul E . McKenney
  Cc: RCU, Neeraj upadhyay, Boqun Feng, Hillf Danton, Joel Fernandes,
	LKML, Uladzislau Rezki, Oleksiy Avramchenko, Frederic Weisbecker

This commit adds rcutree.rcu_normal_wake_from_gp description
to the kernel-parameters.txt file.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 Documentation/admin-guide/kernel-parameters.txt | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 0a1731a0f0ef..65bfbfb09522 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4944,6 +4944,20 @@
 			this kernel boot parameter, forcibly setting it
 			to zero.
 
+	rcutree.rcu_normal_wake_from_gp= [KNL]
+			Reduces a latency of synchronize_rcu() call. This approach
+			maintains its own track of synchronize_rcu() callers, so it
+			does not interact with regular callbacks because it does not
+			use a call_rcu[_hurry]() path. Please note, this is for a
+			normal grace period.
+
+			How to enable it:
+
+			echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp
+			or pass a boot parameter "rcutree.rcu_normal_wake_from_gp=1"
+
+			Default is 0.
+
 	rcuscale.gp_async= [KNL]
 			Measure performance of asynchronous
 			grace-period primitives such as call_rcu().
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 4/7] rcu: Improve handling of synchronize_rcu() users
  2023-11-28  8:00 [PATCH v3 0/7] Reduce synchronize_rcu() latency(V3) Uladzislau Rezki (Sony)
                   ` (2 preceding siblings ...)
  2023-11-28  8:00 ` [PATCH v3 3/7] doc: Add rcutree.rcu_normal_wake_from_gp to kernel-parameters.txt Uladzislau Rezki (Sony)
@ 2023-11-28  8:00 ` Uladzislau Rezki (Sony)
  2023-12-20  1:37   ` Paul E. McKenney
  2023-11-28  8:00 ` [PATCH v3 5/7] rcu: Support direct wake-up " Uladzislau Rezki (Sony)
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 31+ messages in thread
From: Uladzislau Rezki (Sony) @ 2023-11-28  8:00 UTC (permalink / raw)
  To: Paul E . McKenney
  Cc: RCU, Neeraj upadhyay, Boqun Feng, Hillf Danton, Joel Fernandes,
	LKML, Uladzislau Rezki, Oleksiy Avramchenko, Frederic Weisbecker

From: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>

Currently, processing of the next batch of rcu_synchronize nodes
for the new grace period, requires doing a llist reversal operation
to find the tail element of the list. This can be a very costly
operation (high number of cache misses) for a long list.

To address this, this patch introduces a "dummy-wait-node" entity.
At every grace period init, a new wait node is added to the llist.
This wait node is used as wait tail for this new grace period.

This allows lockless additions of new rcu_synchronize nodes in the
rcu_sr_normal_add_req(), while the cleanup work executes and does
the progress. The dummy nodes are removed on next round of cleanup
work execution.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
---
 kernel/rcu/tree.c | 270 +++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 233 insertions(+), 37 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 975621ef40e3..d7b48996825f 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1384,25 +1384,173 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
 		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 }
 
+#define SR_NORMAL_GP_WAIT_HEAD_MAX 5
+
+struct sr_wait_node {
+	atomic_t inuse;
+	struct llist_node node;
+};
+
 /*
- * There are three lists for handling synchronize_rcu() users.
- * A first list corresponds to new coming users, second for users
- * which wait for a grace period and third is for which a grace
- * period is passed.
+ * There is a single llist, which is used for handling
+ * synchronize_rcu() users' enqueued rcu_synchronize nodes.
+ * Within this llist, there are two tail pointers:
+ *
+ * wait tail: Tracks the set of nodes, which need to
+ *            wait for the current GP to complete.
+ * done tail: Tracks the set of nodes, for which grace
+ *            period has elapsed. These nodes processing
+ *            will be done as part of the cleanup work
+ *            execution by a kworker.
+ *
+ * At every grace period init, a new wait node is added
+ * to the llist. This wait node is used as wait tail
+ * for this new grace period. Given that there are a fixed
+ * number of wait nodes, if all wait nodes are in use
+ * (which can happen when kworker callback processing
+ * is delayed) and additional grace period is requested.
+ * This means, a system is slow in processing callbacks.
+ *
+ * TODO: If a slow processing is detected, a first node
+ * in the llist should be used as a wait-tail for this
+ * grace period, therefore users which should wait due
+ * to a slow process are handled by _this_ grace period
+ * and not next.
+ *
+ * Below is an illustration of how the done and wait
+ * tail pointers move from one set of rcu_synchronize nodes
+ * to the other, as grace periods start and finish and
+ * nodes are processed by kworker.
+ *
+ *
+ * a. Initial llist callbacks list:
+ *
+ * +----------+           +--------+          +-------+
+ * |          |           |        |          |       |
+ * |   head   |---------> |   cb2  |--------->| cb1   |
+ * |          |           |        |          |       |
+ * +----------+           +--------+          +-------+
+ *
+ *
+ *
+ * b. New GP1 Start:
+ *
+ *                    WAIT TAIL
+ *                      |
+ *                      |
+ *                      v
+ * +----------+     +--------+      +--------+        +-------+
+ * |          |     |        |      |        |        |       |
+ * |   head   ------> wait   |------>   cb2  |------> |  cb1  |
+ * |          |     | head1  |      |        |        |       |
+ * +----------+     +--------+      +--------+        +-------+
+ *
+ *
+ *
+ * c. GP completion:
+ *
+ * WAIT_TAIL == DONE_TAIL
+ *
+ *                   DONE TAIL
+ *                     |
+ *                     |
+ *                     v
+ * +----------+     +--------+      +--------+        +-------+
+ * |          |     |        |      |        |        |       |
+ * |   head   ------> wait   |------>   cb2  |------> |  cb1  |
+ * |          |     | head1  |      |        |        |       |
+ * +----------+     +--------+      +--------+        +-------+
+ *
+ *
+ *
+ * d. New callbacks and GP2 start:
+ *
+ *                    WAIT TAIL                          DONE TAIL
+ *                      |                                 |
+ *                      |                                 |
+ *                      v                                 v
+ * +----------+     +------+    +------+    +------+    +-----+    +-----+    +-----+
+ * |          |     |      |    |      |    |      |    |     |    |     |    |     |
+ * |   head   ------> wait |--->|  cb4 |--->| cb3  |--->|wait |--->| cb2 |--->| cb1 |
+ * |          |     | head2|    |      |    |      |    |head1|    |     |    |     |
+ * +----------+     +------+    +------+    +------+    +-----+    +-----+    +-----+
+ *
+ *
+ *
+ * e. GP2 completion:
+ *
+ * WAIT_TAIL == DONE_TAIL
+ *                   DONE TAIL
+ *                      |
+ *                      |
+ *                      v
+ * +----------+     +------+    +------+    +------+    +-----+    +-----+    +-----+
+ * |          |     |      |    |      |    |      |    |     |    |     |    |     |
+ * |   head   ------> wait |--->|  cb4 |--->| cb3  |--->|wait |--->| cb2 |--->| cb1 |
+ * |          |     | head2|    |      |    |      |    |head1|    |     |    |     |
+ * +----------+     +------+    +------+    +------+    +-----+    +-----+    +-----+
+ *
+ *
+ * While the llist state transitions from d to e, a kworker
+ * can start executing rcu_sr_normal_gp_cleanup_work() and
+ * can observe either the old done tail (@c) or the new
+ * done tail (@e). So, done tail updates and reads need
+ * to use the rel-acq semantics. If the concurrent kworker
+ * observes the old done tail, the newly queued work
+ * execution will process the updated done tail. If the
+ * concurrent kworker observes the new done tail, then
+ * the newly queued work will skip processing the done
+ * tail, as workqueue semantics guarantees that the new
+ * work is executed only after the previous one completes.
+ *
+ * f. kworker callbacks processing complete:
+ *
+ *
+ *                   DONE TAIL
+ *                     |
+ *                     |
+ *                     v
+ * +----------+     +--------+
+ * |          |     |        |
+ * |   head   ------> wait   |
+ * |          |     | head2  |
+ * +----------+     +--------+
+ *
  */
 static struct sr_normal_state {
 	struct llist_head srs_next;	/* request a GP users. */
-	struct llist_head srs_wait;	/* wait for GP users. */
-	struct llist_head srs_done;	/* ready for GP users. */
-
-	/*
-	 * In order to add a batch of nodes to already
-	 * existing srs-done-list, a tail of srs-wait-list
-	 * is maintained.
-	 */
-	struct llist_node *srs_wait_tail;
+	struct llist_node *srs_wait_tail; /* wait for GP users. */
+	struct llist_node *srs_done_tail; /* ready for GP users. */
+	struct sr_wait_node srs_wait_nodes[SR_NORMAL_GP_WAIT_HEAD_MAX];
 } sr;
 
+static bool rcu_sr_is_wait_head(struct llist_node *node)
+{
+	return &(sr.srs_wait_nodes)[0].node <= node &&
+		node <= &(sr.srs_wait_nodes)[SR_NORMAL_GP_WAIT_HEAD_MAX - 1].node;
+}
+
+static struct llist_node *rcu_sr_get_wait_head(void)
+{
+	struct sr_wait_node *sr_wn;
+	int i;
+
+	for (i = 0; i < SR_NORMAL_GP_WAIT_HEAD_MAX; i++) {
+		sr_wn = &(sr.srs_wait_nodes)[i];
+
+		if (!atomic_cmpxchg_acquire(&sr_wn->inuse, 0, 1))
+			return &sr_wn->node;
+	}
+
+	return NULL;
+}
+
+static void rcu_sr_put_wait_head(struct llist_node *node)
+{
+	struct sr_wait_node *sr_wn = container_of(node, struct sr_wait_node, node);
+	atomic_set_release(&sr_wn->inuse, 0);
+}
+
 /* Disabled by default. */
 static int rcu_normal_wake_from_gp;
 module_param(rcu_normal_wake_from_gp, int, 0644);
@@ -1423,14 +1571,44 @@ static void rcu_sr_normal_complete(struct llist_node *node)
 
 static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
 {
-	struct llist_node *done, *rcu, *next;
+	struct llist_node *done, *rcu, *next, *head;
 
-	done = llist_del_all(&sr.srs_done);
+	/*
+	 * This work execution can potentially execute
+	 * while a new done tail is being updated by
+	 * grace period kthread in rcu_sr_normal_gp_cleanup().
+	 * So, read and updates of done tail need to
+	 * follow acq-rel semantics.
+	 *
+	 * Given that wq semantics guarantees that a single work
+	 * cannot execute concurrently by multiple kworkers,
+	 * the done tail list manipulations are protected here.
+	 */
+	done = smp_load_acquire(&sr.srs_done_tail);
 	if (!done)
 		return;
 
-	llist_for_each_safe(rcu, next, done)
-		rcu_sr_normal_complete(rcu);
+	WARN_ON_ONCE(!rcu_sr_is_wait_head(done));
+	head = done->next;
+	done->next = NULL;
+
+	/*
+	 * The dummy node, which is pointed to by the
+	 * done tail which is acq-read above is not removed
+	 * here.  This allows lockless additions of new
+	 * rcu_synchronize nodes in rcu_sr_normal_add_req(),
+	 * while the cleanup work executes. The dummy
+	 * nodes is removed, in next round of cleanup
+	 * work execution.
+	 */
+	llist_for_each_safe(rcu, next, head) {
+		if (!rcu_sr_is_wait_head(rcu)) {
+			rcu_sr_normal_complete(rcu);
+			continue;
+		}
+
+		rcu_sr_put_wait_head(rcu);
+	}
 }
 static DECLARE_WORK(sr_normal_gp_cleanup, rcu_sr_normal_gp_cleanup_work);
 
@@ -1439,43 +1617,56 @@ static DECLARE_WORK(sr_normal_gp_cleanup, rcu_sr_normal_gp_cleanup_work);
  */
 static void rcu_sr_normal_gp_cleanup(void)
 {
-	struct llist_node *head, *tail;
+	struct llist_node *wait_tail;
 
-	if (llist_empty(&sr.srs_wait))
+	wait_tail = sr.srs_wait_tail;
+	if (wait_tail == NULL)
 		return;
 
-	tail = READ_ONCE(sr.srs_wait_tail);
-	head = __llist_del_all(&sr.srs_wait);
+	sr.srs_wait_tail = NULL;
+	ASSERT_EXCLUSIVE_WRITER(sr.srs_wait_tail);
 
-	if (head) {
-		/* Can be not empty. */
-		llist_add_batch(head, tail, &sr.srs_done);
+	// concurrent sr_normal_gp_cleanup work might observe this update.
+	smp_store_release(&sr.srs_done_tail, wait_tail);
+	ASSERT_EXCLUSIVE_WRITER(sr.srs_done_tail);
+
+	if (wait_tail)
 		queue_work(system_highpri_wq, &sr_normal_gp_cleanup);
-	}
 }
 
 /*
  * Helper function for rcu_gp_init().
  */
-static void rcu_sr_normal_gp_init(void)
+static bool rcu_sr_normal_gp_init(void)
 {
-	struct llist_node *head, *tail;
+	struct llist_node *first;
+	struct llist_node *wait_head;
+	bool start_new_poll = false;
 
-	if (llist_empty(&sr.srs_next))
-		return;
+	first = READ_ONCE(sr.srs_next.first);
+	if (!first || rcu_sr_is_wait_head(first))
+		return start_new_poll;
+
+	wait_head = rcu_sr_get_wait_head();
+	if (!wait_head) {
+		// Kick another GP to retry.
+		start_new_poll = true;
+		return start_new_poll;
+	}
 
-	tail = llist_del_all(&sr.srs_next);
-	head = llist_reverse_order(tail);
+	/* Inject a wait-dummy-node. */
+	llist_add(wait_head, &sr.srs_next);
 
 	/*
-	 * A waiting list of GP should be empty on this step,
-	 * since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
+	 * A waiting list of rcu_synchronize nodes should be empty on
+	 * this step, since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
 	 * rolls it over. If not, it is a BUG, warn a user.
 	 */
-	WARN_ON_ONCE(!llist_empty(&sr.srs_wait));
+	WARN_ON_ONCE(sr.srs_wait_tail != NULL);
+	sr.srs_wait_tail = wait_head;
+	ASSERT_EXCLUSIVE_WRITER(sr.srs_wait_tail);
 
-	WRITE_ONCE(sr.srs_wait_tail, tail);
-	__llist_add_batch(head, tail, &sr.srs_wait);
+	return start_new_poll;
 }
 
 static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
@@ -1493,6 +1684,7 @@ static noinline_for_stack bool rcu_gp_init(void)
 	unsigned long mask;
 	struct rcu_data *rdp;
 	struct rcu_node *rnp = rcu_get_root();
+	bool start_new_poll;
 
 	WRITE_ONCE(rcu_state.gp_activity, jiffies);
 	raw_spin_lock_irq_rcu_node(rnp);
@@ -1517,11 +1709,15 @@ static noinline_for_stack bool rcu_gp_init(void)
 	/* Record GP times before starting GP, hence rcu_seq_start(). */
 	rcu_seq_start(&rcu_state.gp_seq);
 	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
-	rcu_sr_normal_gp_init();
+	start_new_poll = rcu_sr_normal_gp_init();
 	trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
 	rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
 	raw_spin_unlock_irq_rcu_node(rnp);
 
+	// New poll request after rnp unlock
+	if (start_new_poll)
+		(void) start_poll_synchronize_rcu();
+
 	/*
 	 * Apply per-leaf buffered online and offline operations to
 	 * the rcu_node tree. Note that this new grace period need not
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 5/7] rcu: Support direct wake-up of synchronize_rcu() users
  2023-11-28  8:00 [PATCH v3 0/7] Reduce synchronize_rcu() latency(V3) Uladzislau Rezki (Sony)
                   ` (3 preceding siblings ...)
  2023-11-28  8:00 ` [PATCH v3 4/7] rcu: Improve handling of synchronize_rcu() users Uladzislau Rezki (Sony)
@ 2023-11-28  8:00 ` Uladzislau Rezki (Sony)
  2023-12-20  1:46   ` Paul E. McKenney
  2023-11-28  8:00 ` [PATCH v3 6/7] rcu: Move sync related data to rcu_state structure Uladzislau Rezki (Sony)
  2023-11-28  8:00 ` [PATCH v3 7/7] rcu: Add CONFIG_RCU_SR_NORMAL_DEBUG_GP Uladzislau Rezki (Sony)
  6 siblings, 1 reply; 31+ messages in thread
From: Uladzislau Rezki (Sony) @ 2023-11-28  8:00 UTC (permalink / raw)
  To: Paul E . McKenney
  Cc: RCU, Neeraj upadhyay, Boqun Feng, Hillf Danton, Joel Fernandes,
	LKML, Uladzislau Rezki, Oleksiy Avramchenko, Frederic Weisbecker

This patch introduces a small enhancement which allows to do a
direct wake-up of synchronize_rcu() callers. It occurs after a
completion of grace period, thus by the gp-kthread.

Number of clients is limited by the hard-coded maximum allowed
threshold. The remaining part, if still exists is deferred to
a main worker.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 kernel/rcu/tree.c | 39 +++++++++++++++++++++++++++++++++++++--
 1 file changed, 37 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index d7b48996825f..69663a6d5050 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1384,6 +1384,12 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
 		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 }
 
+/*
+ * A max threshold for synchronize_rcu() users which are
+ * awaken directly by the rcu_gp_kthread(). Left part is
+ * deferred to the main worker.
+ */
+#define SR_MAX_USERS_WAKE_FROM_GP 5
 #define SR_NORMAL_GP_WAIT_HEAD_MAX 5
 
 struct sr_wait_node {
@@ -1617,7 +1623,8 @@ static DECLARE_WORK(sr_normal_gp_cleanup, rcu_sr_normal_gp_cleanup_work);
  */
 static void rcu_sr_normal_gp_cleanup(void)
 {
-	struct llist_node *wait_tail;
+	struct llist_node *wait_tail, *head, *rcu;
+	int done = 0;
 
 	wait_tail = sr.srs_wait_tail;
 	if (wait_tail == NULL)
@@ -1626,11 +1633,39 @@ static void rcu_sr_normal_gp_cleanup(void)
 	sr.srs_wait_tail = NULL;
 	ASSERT_EXCLUSIVE_WRITER(sr.srs_wait_tail);
 
+	WARN_ON_ONCE(!rcu_sr_is_wait_head(wait_tail));
+	head = wait_tail->next;
+
+	/*
+	 * Process (a) and (d) cases. See an illustration. Apart of
+	 * that it handles the scenario when all clients are done,
+	 * wait-head is released if last. The worker is not kicked.
+	 */
+	llist_for_each_safe(rcu, head, head) {
+		if (rcu_sr_is_wait_head(rcu)) {
+			if (!rcu->next) {
+				rcu_sr_put_wait_head(rcu);
+				wait_tail->next = NULL;
+			} else {
+				wait_tail->next = rcu;
+			}
+
+			break;
+		}
+
+		rcu_sr_normal_complete(rcu);
+		// It can be last, update a next on this step.
+		wait_tail->next = head;
+
+		if (++done == SR_MAX_USERS_WAKE_FROM_GP)
+			break;
+	}
+
 	// concurrent sr_normal_gp_cleanup work might observe this update.
 	smp_store_release(&sr.srs_done_tail, wait_tail);
 	ASSERT_EXCLUSIVE_WRITER(sr.srs_done_tail);
 
-	if (wait_tail)
+	if (wait_tail->next)
 		queue_work(system_highpri_wq, &sr_normal_gp_cleanup);
 }
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 6/7] rcu: Move sync related data to rcu_state structure
  2023-11-28  8:00 [PATCH v3 0/7] Reduce synchronize_rcu() latency(V3) Uladzislau Rezki (Sony)
                   ` (4 preceding siblings ...)
  2023-11-28  8:00 ` [PATCH v3 5/7] rcu: Support direct wake-up " Uladzislau Rezki (Sony)
@ 2023-11-28  8:00 ` Uladzislau Rezki (Sony)
  2023-12-20  1:47   ` Paul E. McKenney
  2023-11-28  8:00 ` [PATCH v3 7/7] rcu: Add CONFIG_RCU_SR_NORMAL_DEBUG_GP Uladzislau Rezki (Sony)
  6 siblings, 1 reply; 31+ messages in thread
From: Uladzislau Rezki (Sony) @ 2023-11-28  8:00 UTC (permalink / raw)
  To: Paul E . McKenney
  Cc: RCU, Neeraj upadhyay, Boqun Feng, Hillf Danton, Joel Fernandes,
	LKML, Uladzislau Rezki, Oleksiy Avramchenko, Frederic Weisbecker

Move synchronize_rcu() main control data under the rcu_state
structure. An access is done via "rcu_state" global variable.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 kernel/rcu/tree.c | 50 ++++++++++++++---------------------------------
 kernel/rcu/tree.h | 19 ++++++++++++++++++
 2 files changed, 34 insertions(+), 35 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 69663a6d5050..c0d3e46730e8 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1384,19 +1384,6 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
 		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 }
 
-/*
- * A max threshold for synchronize_rcu() users which are
- * awaken directly by the rcu_gp_kthread(). Left part is
- * deferred to the main worker.
- */
-#define SR_MAX_USERS_WAKE_FROM_GP 5
-#define SR_NORMAL_GP_WAIT_HEAD_MAX 5
-
-struct sr_wait_node {
-	atomic_t inuse;
-	struct llist_node node;
-};
-
 /*
  * There is a single llist, which is used for handling
  * synchronize_rcu() users' enqueued rcu_synchronize nodes.
@@ -1523,17 +1510,10 @@ struct sr_wait_node {
  * +----------+     +--------+
  *
  */
-static struct sr_normal_state {
-	struct llist_head srs_next;	/* request a GP users. */
-	struct llist_node *srs_wait_tail; /* wait for GP users. */
-	struct llist_node *srs_done_tail; /* ready for GP users. */
-	struct sr_wait_node srs_wait_nodes[SR_NORMAL_GP_WAIT_HEAD_MAX];
-} sr;
-
 static bool rcu_sr_is_wait_head(struct llist_node *node)
 {
-	return &(sr.srs_wait_nodes)[0].node <= node &&
-		node <= &(sr.srs_wait_nodes)[SR_NORMAL_GP_WAIT_HEAD_MAX - 1].node;
+	return &(rcu_state.srs_wait_nodes)[0].node <= node &&
+		node <= &(rcu_state.srs_wait_nodes)[SR_NORMAL_GP_WAIT_HEAD_MAX - 1].node;
 }
 
 static struct llist_node *rcu_sr_get_wait_head(void)
@@ -1542,7 +1522,7 @@ static struct llist_node *rcu_sr_get_wait_head(void)
 	int i;
 
 	for (i = 0; i < SR_NORMAL_GP_WAIT_HEAD_MAX; i++) {
-		sr_wn = &(sr.srs_wait_nodes)[i];
+		sr_wn = &(rcu_state.srs_wait_nodes)[i];
 
 		if (!atomic_cmpxchg_acquire(&sr_wn->inuse, 0, 1))
 			return &sr_wn->node;
@@ -1590,7 +1570,7 @@ static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
 	 * cannot execute concurrently by multiple kworkers,
 	 * the done tail list manipulations are protected here.
 	 */
-	done = smp_load_acquire(&sr.srs_done_tail);
+	done = smp_load_acquire(&rcu_state.srs_done_tail);
 	if (!done)
 		return;
 
@@ -1626,12 +1606,12 @@ static void rcu_sr_normal_gp_cleanup(void)
 	struct llist_node *wait_tail, *head, *rcu;
 	int done = 0;
 
-	wait_tail = sr.srs_wait_tail;
+	wait_tail = rcu_state.srs_wait_tail;
 	if (wait_tail == NULL)
 		return;
 
-	sr.srs_wait_tail = NULL;
-	ASSERT_EXCLUSIVE_WRITER(sr.srs_wait_tail);
+	rcu_state.srs_wait_tail = NULL;
+	ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_wait_tail);
 
 	WARN_ON_ONCE(!rcu_sr_is_wait_head(wait_tail));
 	head = wait_tail->next;
@@ -1662,8 +1642,8 @@ static void rcu_sr_normal_gp_cleanup(void)
 	}
 
 	// concurrent sr_normal_gp_cleanup work might observe this update.
-	smp_store_release(&sr.srs_done_tail, wait_tail);
-	ASSERT_EXCLUSIVE_WRITER(sr.srs_done_tail);
+	smp_store_release(&rcu_state.srs_done_tail, wait_tail);
+	ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_done_tail);
 
 	if (wait_tail->next)
 		queue_work(system_highpri_wq, &sr_normal_gp_cleanup);
@@ -1678,7 +1658,7 @@ static bool rcu_sr_normal_gp_init(void)
 	struct llist_node *wait_head;
 	bool start_new_poll = false;
 
-	first = READ_ONCE(sr.srs_next.first);
+	first = READ_ONCE(rcu_state.srs_next.first);
 	if (!first || rcu_sr_is_wait_head(first))
 		return start_new_poll;
 
@@ -1690,23 +1670,23 @@ static bool rcu_sr_normal_gp_init(void)
 	}
 
 	/* Inject a wait-dummy-node. */
-	llist_add(wait_head, &sr.srs_next);
+	llist_add(wait_head, &rcu_state.srs_next);
 
 	/*
 	 * A waiting list of rcu_synchronize nodes should be empty on
 	 * this step, since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
 	 * rolls it over. If not, it is a BUG, warn a user.
 	 */
-	WARN_ON_ONCE(sr.srs_wait_tail != NULL);
-	sr.srs_wait_tail = wait_head;
-	ASSERT_EXCLUSIVE_WRITER(sr.srs_wait_tail);
+	WARN_ON_ONCE(rcu_state.srs_wait_tail != NULL);
+	rcu_state.srs_wait_tail = wait_head;
+	ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_wait_tail);
 
 	return start_new_poll;
 }
 
 static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
 {
-	llist_add((struct llist_node *) &rs->head, &sr.srs_next);
+	llist_add((struct llist_node *) &rs->head, &rcu_state.srs_next);
 }
 
 /*
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 192536916f9a..f72166b5067a 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -316,6 +316,19 @@ do {									\
 	__set_current_state(TASK_RUNNING);				\
 } while (0)
 
+/*
+ * A max threshold for synchronize_rcu() users which are
+ * awaken directly by the rcu_gp_kthread(). Left part is
+ * deferred to the main worker.
+ */
+#define SR_MAX_USERS_WAKE_FROM_GP 5
+#define SR_NORMAL_GP_WAIT_HEAD_MAX 5
+
+struct sr_wait_node {
+	atomic_t inuse;
+	struct llist_node node;
+};
+
 /*
  * RCU global state, including node hierarchy.  This hierarchy is
  * represented in "heap" form in a dense array.  The root (first level)
@@ -397,6 +410,12 @@ struct rcu_state {
 						/* Synchronize offline with */
 						/*  GP pre-initialization. */
 	int nocb_is_setup;			/* nocb is setup from boot */
+
+	/* synchronize_rcu() part. */
+	struct llist_head srs_next;	/* request a GP users. */
+	struct llist_node *srs_wait_tail; /* wait for GP users. */
+	struct llist_node *srs_done_tail; /* ready for GP users. */
+	struct sr_wait_node srs_wait_nodes[SR_NORMAL_GP_WAIT_HEAD_MAX];
 };
 
 /* Values for rcu_state structure's gp_flags field. */
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 7/7] rcu: Add CONFIG_RCU_SR_NORMAL_DEBUG_GP
  2023-11-28  8:00 [PATCH v3 0/7] Reduce synchronize_rcu() latency(V3) Uladzislau Rezki (Sony)
                   ` (5 preceding siblings ...)
  2023-11-28  8:00 ` [PATCH v3 6/7] rcu: Move sync related data to rcu_state structure Uladzislau Rezki (Sony)
@ 2023-11-28  8:00 ` Uladzislau Rezki (Sony)
  2023-12-20  1:14   ` Paul E. McKenney
  6 siblings, 1 reply; 31+ messages in thread
From: Uladzislau Rezki (Sony) @ 2023-11-28  8:00 UTC (permalink / raw)
  To: Paul E . McKenney
  Cc: RCU, Neeraj upadhyay, Boqun Feng, Hillf Danton, Joel Fernandes,
	LKML, Uladzislau Rezki, Oleksiy Avramchenko, Frederic Weisbecker

This option enables additional debugging for detecting a grace
period incompletion for synchronize_rcu() users. If a GP is not
fully passed for any user, the warning message is emitted.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 kernel/rcu/Kconfig.debug | 12 ++++++++++++
 kernel/rcu/tree.c        |  7 +++++--
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/Kconfig.debug b/kernel/rcu/Kconfig.debug
index 2984de629f74..3d44106ca1f0 100644
--- a/kernel/rcu/Kconfig.debug
+++ b/kernel/rcu/Kconfig.debug
@@ -143,4 +143,16 @@ config RCU_STRICT_GRACE_PERIOD
 	  when looking for certain types of RCU usage bugs, for example,
 	  too-short RCU read-side critical sections.
 
+config RCU_SR_NORMAL_DEBUG_GP
+	bool "Debug synchronize_rcu() callers for a grace period completion"
+	depends on DEBUG_KERNEL && RCU_EXPERT
+	default n
+	help
+	  This option enables additional debugging for detecting a grace
+	  period incompletion for synchronize_rcu() users. If a GP is not
+	  fully passed for any user, the warning message is emitted.
+
+	  Say Y here if you want to enable such debugging
+	  Say N if you are unsure.
+
 endmenu # "RCU Debugging"
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index c0d3e46730e8..421bce4b8dd7 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1547,7 +1547,8 @@ static void rcu_sr_normal_complete(struct llist_node *node)
 		(struct rcu_head *) node, struct rcu_synchronize, head);
 	unsigned long oldstate = (unsigned long) rs->head.func;
 
-	WARN_ONCE(!rcu_gp_is_expedited() && !poll_state_synchronize_rcu(oldstate),
+	WARN_ONCE(IS_ENABLED(CONFIG_RCU_SR_NORMAL_DEBUG_GP) &&
+		!poll_state_synchronize_rcu(oldstate),
 		"A full grace period is not passed yet: %lu",
 		rcu_seq_diff(get_state_synchronize_rcu(), oldstate));
 
@@ -3822,7 +3823,9 @@ static void synchronize_rcu_normal(void)
 	 * This code might be preempted, therefore take a GP
 	 * snapshot before adding a request.
 	 */
-	rs.head.func = (void *) get_state_synchronize_rcu();
+	if (IS_ENABLED(CONFIG_RCU_SR_NORMAL_DEBUG_GP))
+		rs.head.func = (void *) get_state_synchronize_rcu();
+
 	rcu_sr_normal_add_req(&rs);
 
 	/* Kick a GP and start waiting. */
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 7/7] rcu: Add CONFIG_RCU_SR_NORMAL_DEBUG_GP
  2023-11-28  8:00 ` [PATCH v3 7/7] rcu: Add CONFIG_RCU_SR_NORMAL_DEBUG_GP Uladzislau Rezki (Sony)
@ 2023-12-20  1:14   ` Paul E. McKenney
  2023-12-21 10:27     ` Uladzislau Rezki
  0 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2023-12-20  1:14 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: RCU, Neeraj upadhyay, Boqun Feng, Hillf Danton, Joel Fernandes,
	LKML, Oleksiy Avramchenko, Frederic Weisbecker

On Tue, Nov 28, 2023 at 09:00:33AM +0100, Uladzislau Rezki (Sony) wrote:
> This option enables additional debugging for detecting a grace
> period incompletion for synchronize_rcu() users. If a GP is not
> fully passed for any user, the warning message is emitted.
> 
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

Much better, thank you, as this avoids the possibility of false positives
in production.  But to avoid potential bisection issues, could you please
fold this into the first patch?

							Thanx, Paul

> ---
>  kernel/rcu/Kconfig.debug | 12 ++++++++++++
>  kernel/rcu/tree.c        |  7 +++++--
>  2 files changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/rcu/Kconfig.debug b/kernel/rcu/Kconfig.debug
> index 2984de629f74..3d44106ca1f0 100644
> --- a/kernel/rcu/Kconfig.debug
> +++ b/kernel/rcu/Kconfig.debug
> @@ -143,4 +143,16 @@ config RCU_STRICT_GRACE_PERIOD
>  	  when looking for certain types of RCU usage bugs, for example,
>  	  too-short RCU read-side critical sections.
>  
> +config RCU_SR_NORMAL_DEBUG_GP
> +	bool "Debug synchronize_rcu() callers for a grace period completion"
> +	depends on DEBUG_KERNEL && RCU_EXPERT
> +	default n
> +	help
> +	  This option enables additional debugging for detecting a grace
> +	  period incompletion for synchronize_rcu() users. If a GP is not
> +	  fully passed for any user, the warning message is emitted.
> +
> +	  Say Y here if you want to enable such debugging
> +	  Say N if you are unsure.
> +
>  endmenu # "RCU Debugging"
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index c0d3e46730e8..421bce4b8dd7 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1547,7 +1547,8 @@ static void rcu_sr_normal_complete(struct llist_node *node)
>  		(struct rcu_head *) node, struct rcu_synchronize, head);
>  	unsigned long oldstate = (unsigned long) rs->head.func;
>  
> -	WARN_ONCE(!rcu_gp_is_expedited() && !poll_state_synchronize_rcu(oldstate),
> +	WARN_ONCE(IS_ENABLED(CONFIG_RCU_SR_NORMAL_DEBUG_GP) &&
> +		!poll_state_synchronize_rcu(oldstate),
>  		"A full grace period is not passed yet: %lu",
>  		rcu_seq_diff(get_state_synchronize_rcu(), oldstate));
>  
> @@ -3822,7 +3823,9 @@ static void synchronize_rcu_normal(void)
>  	 * This code might be preempted, therefore take a GP
>  	 * snapshot before adding a request.
>  	 */
> -	rs.head.func = (void *) get_state_synchronize_rcu();
> +	if (IS_ENABLED(CONFIG_RCU_SR_NORMAL_DEBUG_GP))
> +		rs.head.func = (void *) get_state_synchronize_rcu();
> +
>  	rcu_sr_normal_add_req(&rs);
>  
>  	/* Kick a GP and start waiting. */
> -- 
> 2.39.2
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 3/7] doc: Add rcutree.rcu_normal_wake_from_gp to kernel-parameters.txt
  2023-11-28  8:00 ` [PATCH v3 3/7] doc: Add rcutree.rcu_normal_wake_from_gp to kernel-parameters.txt Uladzislau Rezki (Sony)
@ 2023-12-20  1:17   ` Paul E. McKenney
  2023-12-21 10:28     ` Uladzislau Rezki
  0 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2023-12-20  1:17 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: RCU, Neeraj upadhyay, Boqun Feng, Hillf Danton, Joel Fernandes,
	LKML, Oleksiy Avramchenko, Frederic Weisbecker

On Tue, Nov 28, 2023 at 09:00:29AM +0100, Uladzislau Rezki (Sony) wrote:
> This commit adds rcutree.rcu_normal_wake_from_gp description
> to the kernel-parameters.txt file.
> 
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

Very good, but could you please fold this into the first patch, which
is the one that adds this kernel parameter?

							Thanx, Paul

> ---
>  Documentation/admin-guide/kernel-parameters.txt | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 0a1731a0f0ef..65bfbfb09522 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -4944,6 +4944,20 @@
>  			this kernel boot parameter, forcibly setting it
>  			to zero.
>  
> +	rcutree.rcu_normal_wake_from_gp= [KNL]
> +			Reduces a latency of synchronize_rcu() call. This approach
> +			maintains its own track of synchronize_rcu() callers, so it
> +			does not interact with regular callbacks because it does not
> +			use a call_rcu[_hurry]() path. Please note, this is for a
> +			normal grace period.
> +
> +			How to enable it:
> +
> +			echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp
> +			or pass a boot parameter "rcutree.rcu_normal_wake_from_gp=1"
> +
> +			Default is 0.
> +
>  	rcuscale.gp_async= [KNL]
>  			Measure performance of asynchronous
>  			grace-period primitives such as call_rcu().
> -- 
> 2.39.2
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/7] rcu: Improve handling of synchronize_rcu() users
  2023-11-28  8:00 ` [PATCH v3 4/7] rcu: Improve handling of synchronize_rcu() users Uladzislau Rezki (Sony)
@ 2023-12-20  1:37   ` Paul E. McKenney
  2023-12-21 10:52     ` Uladzislau Rezki
  0 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2023-12-20  1:37 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: RCU, Neeraj upadhyay, Boqun Feng, Hillf Danton, Joel Fernandes,
	LKML, Oleksiy Avramchenko, Frederic Weisbecker

On Tue, Nov 28, 2023 at 09:00:30AM +0100, Uladzislau Rezki (Sony) wrote:
> From: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
> 
> Currently, processing of the next batch of rcu_synchronize nodes
> for the new grace period, requires doing a llist reversal operation
> to find the tail element of the list. This can be a very costly
> operation (high number of cache misses) for a long list.
> 
> To address this, this patch introduces a "dummy-wait-node" entity.
> At every grace period init, a new wait node is added to the llist.
> This wait node is used as wait tail for this new grace period.
> 
> This allows lockless additions of new rcu_synchronize nodes in the
> rcu_sr_normal_add_req(), while the cleanup work executes and does
> the progress. The dummy nodes are removed on next round of cleanup
> work execution.
> 
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>

This says that Uladzislau created the patch and that Neeraj
acted as maintainer.  I am guessing that you both worked on it,
in which case is should have the Co-developed-by tags as shown in
Documentation/process/submitting-patches.rst.  Could you please update
these to reflect the actual origin?

One question below toward the end.  There are probably others that I
should be asking, but I have to start somewhere.  ;-)

						Thanx, Paul

> ---
>  kernel/rcu/tree.c | 270 +++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 233 insertions(+), 37 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 975621ef40e3..d7b48996825f 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1384,25 +1384,173 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
>  		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>  }
>  
> +#define SR_NORMAL_GP_WAIT_HEAD_MAX 5
> +
> +struct sr_wait_node {
> +	atomic_t inuse;
> +	struct llist_node node;
> +};
> +
>  /*
> - * There are three lists for handling synchronize_rcu() users.
> - * A first list corresponds to new coming users, second for users
> - * which wait for a grace period and third is for which a grace
> - * period is passed.
> + * There is a single llist, which is used for handling
> + * synchronize_rcu() users' enqueued rcu_synchronize nodes.
> + * Within this llist, there are two tail pointers:
> + *
> + * wait tail: Tracks the set of nodes, which need to
> + *            wait for the current GP to complete.
> + * done tail: Tracks the set of nodes, for which grace
> + *            period has elapsed. These nodes processing
> + *            will be done as part of the cleanup work
> + *            execution by a kworker.
> + *
> + * At every grace period init, a new wait node is added
> + * to the llist. This wait node is used as wait tail
> + * for this new grace period. Given that there are a fixed
> + * number of wait nodes, if all wait nodes are in use
> + * (which can happen when kworker callback processing
> + * is delayed) and additional grace period is requested.
> + * This means, a system is slow in processing callbacks.
> + *
> + * TODO: If a slow processing is detected, a first node
> + * in the llist should be used as a wait-tail for this
> + * grace period, therefore users which should wait due
> + * to a slow process are handled by _this_ grace period
> + * and not next.
> + *
> + * Below is an illustration of how the done and wait
> + * tail pointers move from one set of rcu_synchronize nodes
> + * to the other, as grace periods start and finish and
> + * nodes are processed by kworker.
> + *
> + *
> + * a. Initial llist callbacks list:
> + *
> + * +----------+           +--------+          +-------+
> + * |          |           |        |          |       |
> + * |   head   |---------> |   cb2  |--------->| cb1   |
> + * |          |           |        |          |       |
> + * +----------+           +--------+          +-------+
> + *
> + *
> + *
> + * b. New GP1 Start:
> + *
> + *                    WAIT TAIL
> + *                      |
> + *                      |
> + *                      v
> + * +----------+     +--------+      +--------+        +-------+
> + * |          |     |        |      |        |        |       |
> + * |   head   ------> wait   |------>   cb2  |------> |  cb1  |
> + * |          |     | head1  |      |        |        |       |
> + * +----------+     +--------+      +--------+        +-------+
> + *
> + *
> + *
> + * c. GP completion:
> + *
> + * WAIT_TAIL == DONE_TAIL
> + *
> + *                   DONE TAIL
> + *                     |
> + *                     |
> + *                     v
> + * +----------+     +--------+      +--------+        +-------+
> + * |          |     |        |      |        |        |       |
> + * |   head   ------> wait   |------>   cb2  |------> |  cb1  |
> + * |          |     | head1  |      |        |        |       |
> + * +----------+     +--------+      +--------+        +-------+
> + *
> + *
> + *
> + * d. New callbacks and GP2 start:
> + *
> + *                    WAIT TAIL                          DONE TAIL
> + *                      |                                 |
> + *                      |                                 |
> + *                      v                                 v
> + * +----------+     +------+    +------+    +------+    +-----+    +-----+    +-----+
> + * |          |     |      |    |      |    |      |    |     |    |     |    |     |
> + * |   head   ------> wait |--->|  cb4 |--->| cb3  |--->|wait |--->| cb2 |--->| cb1 |
> + * |          |     | head2|    |      |    |      |    |head1|    |     |    |     |
> + * +----------+     +------+    +------+    +------+    +-----+    +-----+    +-----+
> + *
> + *
> + *
> + * e. GP2 completion:
> + *
> + * WAIT_TAIL == DONE_TAIL
> + *                   DONE TAIL
> + *                      |
> + *                      |
> + *                      v
> + * +----------+     +------+    +------+    +------+    +-----+    +-----+    +-----+
> + * |          |     |      |    |      |    |      |    |     |    |     |    |     |
> + * |   head   ------> wait |--->|  cb4 |--->| cb3  |--->|wait |--->| cb2 |--->| cb1 |
> + * |          |     | head2|    |      |    |      |    |head1|    |     |    |     |
> + * +----------+     +------+    +------+    +------+    +-----+    +-----+    +-----+
> + *
> + *
> + * While the llist state transitions from d to e, a kworker
> + * can start executing rcu_sr_normal_gp_cleanup_work() and
> + * can observe either the old done tail (@c) or the new
> + * done tail (@e). So, done tail updates and reads need
> + * to use the rel-acq semantics. If the concurrent kworker
> + * observes the old done tail, the newly queued work
> + * execution will process the updated done tail. If the
> + * concurrent kworker observes the new done tail, then
> + * the newly queued work will skip processing the done
> + * tail, as workqueue semantics guarantees that the new
> + * work is executed only after the previous one completes.
> + *
> + * f. kworker callbacks processing complete:
> + *
> + *
> + *                   DONE TAIL
> + *                     |
> + *                     |
> + *                     v
> + * +----------+     +--------+
> + * |          |     |        |
> + * |   head   ------> wait   |
> + * |          |     | head2  |
> + * +----------+     +--------+
> + *
>   */
>  static struct sr_normal_state {
>  	struct llist_head srs_next;	/* request a GP users. */
> -	struct llist_head srs_wait;	/* wait for GP users. */
> -	struct llist_head srs_done;	/* ready for GP users. */
> -
> -	/*
> -	 * In order to add a batch of nodes to already
> -	 * existing srs-done-list, a tail of srs-wait-list
> -	 * is maintained.
> -	 */
> -	struct llist_node *srs_wait_tail;
> +	struct llist_node *srs_wait_tail; /* wait for GP users. */
> +	struct llist_node *srs_done_tail; /* ready for GP users. */
> +	struct sr_wait_node srs_wait_nodes[SR_NORMAL_GP_WAIT_HEAD_MAX];
>  } sr;
>  
> +static bool rcu_sr_is_wait_head(struct llist_node *node)
> +{
> +	return &(sr.srs_wait_nodes)[0].node <= node &&
> +		node <= &(sr.srs_wait_nodes)[SR_NORMAL_GP_WAIT_HEAD_MAX - 1].node;
> +}
> +
> +static struct llist_node *rcu_sr_get_wait_head(void)
> +{
> +	struct sr_wait_node *sr_wn;
> +	int i;
> +
> +	for (i = 0; i < SR_NORMAL_GP_WAIT_HEAD_MAX; i++) {
> +		sr_wn = &(sr.srs_wait_nodes)[i];
> +
> +		if (!atomic_cmpxchg_acquire(&sr_wn->inuse, 0, 1))
> +			return &sr_wn->node;
> +	}
> +
> +	return NULL;
> +}
> +
> +static void rcu_sr_put_wait_head(struct llist_node *node)
> +{
> +	struct sr_wait_node *sr_wn = container_of(node, struct sr_wait_node, node);
> +	atomic_set_release(&sr_wn->inuse, 0);
> +}
> +
>  /* Disabled by default. */
>  static int rcu_normal_wake_from_gp;
>  module_param(rcu_normal_wake_from_gp, int, 0644);
> @@ -1423,14 +1571,44 @@ static void rcu_sr_normal_complete(struct llist_node *node)
>  
>  static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
>  {
> -	struct llist_node *done, *rcu, *next;
> +	struct llist_node *done, *rcu, *next, *head;
>  
> -	done = llist_del_all(&sr.srs_done);
> +	/*
> +	 * This work execution can potentially execute
> +	 * while a new done tail is being updated by
> +	 * grace period kthread in rcu_sr_normal_gp_cleanup().
> +	 * So, read and updates of done tail need to
> +	 * follow acq-rel semantics.
> +	 *
> +	 * Given that wq semantics guarantees that a single work
> +	 * cannot execute concurrently by multiple kworkers,
> +	 * the done tail list manipulations are protected here.
> +	 */
> +	done = smp_load_acquire(&sr.srs_done_tail);
>  	if (!done)
>  		return;
>  
> -	llist_for_each_safe(rcu, next, done)
> -		rcu_sr_normal_complete(rcu);
> +	WARN_ON_ONCE(!rcu_sr_is_wait_head(done));
> +	head = done->next;
> +	done->next = NULL;
> +
> +	/*
> +	 * The dummy node, which is pointed to by the
> +	 * done tail which is acq-read above is not removed
> +	 * here.  This allows lockless additions of new
> +	 * rcu_synchronize nodes in rcu_sr_normal_add_req(),
> +	 * while the cleanup work executes. The dummy
> +	 * nodes is removed, in next round of cleanup
> +	 * work execution.
> +	 */
> +	llist_for_each_safe(rcu, next, head) {
> +		if (!rcu_sr_is_wait_head(rcu)) {
> +			rcu_sr_normal_complete(rcu);
> +			continue;
> +		}
> +
> +		rcu_sr_put_wait_head(rcu);
> +	}
>  }
>  static DECLARE_WORK(sr_normal_gp_cleanup, rcu_sr_normal_gp_cleanup_work);
>  
> @@ -1439,43 +1617,56 @@ static DECLARE_WORK(sr_normal_gp_cleanup, rcu_sr_normal_gp_cleanup_work);
>   */
>  static void rcu_sr_normal_gp_cleanup(void)
>  {
> -	struct llist_node *head, *tail;
> +	struct llist_node *wait_tail;
>  
> -	if (llist_empty(&sr.srs_wait))
> +	wait_tail = sr.srs_wait_tail;
> +	if (wait_tail == NULL)
>  		return;
>  
> -	tail = READ_ONCE(sr.srs_wait_tail);
> -	head = __llist_del_all(&sr.srs_wait);
> +	sr.srs_wait_tail = NULL;
> +	ASSERT_EXCLUSIVE_WRITER(sr.srs_wait_tail);
>  
> -	if (head) {
> -		/* Can be not empty. */
> -		llist_add_batch(head, tail, &sr.srs_done);
> +	// concurrent sr_normal_gp_cleanup work might observe this update.
> +	smp_store_release(&sr.srs_done_tail, wait_tail);
> +	ASSERT_EXCLUSIVE_WRITER(sr.srs_done_tail);
> +
> +	if (wait_tail)
>  		queue_work(system_highpri_wq, &sr_normal_gp_cleanup);
> -	}
>  }
>  
>  /*
>   * Helper function for rcu_gp_init().
>   */
> -static void rcu_sr_normal_gp_init(void)
> +static bool rcu_sr_normal_gp_init(void)
>  {
> -	struct llist_node *head, *tail;
> +	struct llist_node *first;
> +	struct llist_node *wait_head;
> +	bool start_new_poll = false;
>  
> -	if (llist_empty(&sr.srs_next))
> -		return;
> +	first = READ_ONCE(sr.srs_next.first);
> +	if (!first || rcu_sr_is_wait_head(first))
> +		return start_new_poll;
> +
> +	wait_head = rcu_sr_get_wait_head();
> +	if (!wait_head) {
> +		// Kick another GP to retry.
> +		start_new_poll = true;
> +		return start_new_poll;
> +	}
>  
> -	tail = llist_del_all(&sr.srs_next);
> -	head = llist_reverse_order(tail);
> +	/* Inject a wait-dummy-node. */
> +	llist_add(wait_head, &sr.srs_next);
>  
>  	/*
> -	 * A waiting list of GP should be empty on this step,
> -	 * since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
> +	 * A waiting list of rcu_synchronize nodes should be empty on
> +	 * this step, since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
>  	 * rolls it over. If not, it is a BUG, warn a user.
>  	 */
> -	WARN_ON_ONCE(!llist_empty(&sr.srs_wait));
> +	WARN_ON_ONCE(sr.srs_wait_tail != NULL);
> +	sr.srs_wait_tail = wait_head;
> +	ASSERT_EXCLUSIVE_WRITER(sr.srs_wait_tail);
>  
> -	WRITE_ONCE(sr.srs_wait_tail, tail);
> -	__llist_add_batch(head, tail, &sr.srs_wait);
> +	return start_new_poll;
>  }
>  
>  static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
> @@ -1493,6 +1684,7 @@ static noinline_for_stack bool rcu_gp_init(void)
>  	unsigned long mask;
>  	struct rcu_data *rdp;
>  	struct rcu_node *rnp = rcu_get_root();
> +	bool start_new_poll;
>  
>  	WRITE_ONCE(rcu_state.gp_activity, jiffies);
>  	raw_spin_lock_irq_rcu_node(rnp);
> @@ -1517,11 +1709,15 @@ static noinline_for_stack bool rcu_gp_init(void)
>  	/* Record GP times before starting GP, hence rcu_seq_start(). */
>  	rcu_seq_start(&rcu_state.gp_seq);
>  	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> -	rcu_sr_normal_gp_init();
> +	start_new_poll = rcu_sr_normal_gp_init();
>  	trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
>  	rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
>  	raw_spin_unlock_irq_rcu_node(rnp);
>  
> +	// New poll request after rnp unlock
> +	if (start_new_poll)
> +		(void) start_poll_synchronize_rcu();

You lost me on this one.  Anything that got moved to the wait list
should be handled by the current grace period, right?  Or is the
problem that rcu_sr_normal_gp_init() is being invoked after the call
to rcu_seq_start()?  If that is the case, could it be moved ahead so
that we don't need the extra grace period?

Or am I missing something subtle here?

> +
>  	/*
>  	 * Apply per-leaf buffered online and offline operations to
>  	 * the rcu_node tree. Note that this new grace period need not
> -- 
> 2.39.2
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 5/7] rcu: Support direct wake-up of synchronize_rcu() users
  2023-11-28  8:00 ` [PATCH v3 5/7] rcu: Support direct wake-up " Uladzislau Rezki (Sony)
@ 2023-12-20  1:46   ` Paul E. McKenney
  2023-12-21 11:22     ` Uladzislau Rezki
  0 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2023-12-20  1:46 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: RCU, Neeraj upadhyay, Boqun Feng, Hillf Danton, Joel Fernandes,
	LKML, Oleksiy Avramchenko, Frederic Weisbecker

On Tue, Nov 28, 2023 at 09:00:31AM +0100, Uladzislau Rezki (Sony) wrote:
> This patch introduces a small enhancement which allows to do a
> direct wake-up of synchronize_rcu() callers. It occurs after a
> completion of grace period, thus by the gp-kthread.
> 
> Number of clients is limited by the hard-coded maximum allowed
> threshold. The remaining part, if still exists is deferred to
> a main worker.
> 
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

Nice optimization!

One question below.

> ---
>  kernel/rcu/tree.c | 39 +++++++++++++++++++++++++++++++++++++--
>  1 file changed, 37 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index d7b48996825f..69663a6d5050 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1384,6 +1384,12 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
>  		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>  }
>  
> +/*
> + * A max threshold for synchronize_rcu() users which are
> + * awaken directly by the rcu_gp_kthread(). Left part is
> + * deferred to the main worker.
> + */
> +#define SR_MAX_USERS_WAKE_FROM_GP 5
>  #define SR_NORMAL_GP_WAIT_HEAD_MAX 5
>  
>  struct sr_wait_node {
> @@ -1617,7 +1623,8 @@ static DECLARE_WORK(sr_normal_gp_cleanup, rcu_sr_normal_gp_cleanup_work);
>   */
>  static void rcu_sr_normal_gp_cleanup(void)
>  {
> -	struct llist_node *wait_tail;
> +	struct llist_node *wait_tail, *head, *rcu;
> +	int done = 0;
>  
>  	wait_tail = sr.srs_wait_tail;
>  	if (wait_tail == NULL)
> @@ -1626,11 +1633,39 @@ static void rcu_sr_normal_gp_cleanup(void)
>  	sr.srs_wait_tail = NULL;
>  	ASSERT_EXCLUSIVE_WRITER(sr.srs_wait_tail);
>  
> +	WARN_ON_ONCE(!rcu_sr_is_wait_head(wait_tail));
> +	head = wait_tail->next;
> +
> +	/*
> +	 * Process (a) and (d) cases. See an illustration. Apart of
> +	 * that it handles the scenario when all clients are done,
> +	 * wait-head is released if last. The worker is not kicked.
> +	 */
> +	llist_for_each_safe(rcu, head, head) {

This does appear to be a clever way to save eight bytes on the stack,
but is our stack space really so restricted?  We are being invoked from
the RCU GP kthread, which isn't using much stack, right?

If so, let's spend the extra local variable and spare the reader a
trip to the llist_for_each_safe() definition.

> +		if (rcu_sr_is_wait_head(rcu)) {
> +			if (!rcu->next) {
> +				rcu_sr_put_wait_head(rcu);
> +				wait_tail->next = NULL;
> +			} else {
> +				wait_tail->next = rcu;
> +			}
> +
> +			break;
> +		}
> +
> +		rcu_sr_normal_complete(rcu);
> +		// It can be last, update a next on this step.
> +		wait_tail->next = head;
> +
> +		if (++done == SR_MAX_USERS_WAKE_FROM_GP)
> +			break;
> +	}
> +
>  	// concurrent sr_normal_gp_cleanup work might observe this update.
>  	smp_store_release(&sr.srs_done_tail, wait_tail);
>  	ASSERT_EXCLUSIVE_WRITER(sr.srs_done_tail);
>  
> -	if (wait_tail)
> +	if (wait_tail->next)
>  		queue_work(system_highpri_wq, &sr_normal_gp_cleanup);
>  }
>  
> -- 
> 2.39.2
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 6/7] rcu: Move sync related data to rcu_state structure
  2023-11-28  8:00 ` [PATCH v3 6/7] rcu: Move sync related data to rcu_state structure Uladzislau Rezki (Sony)
@ 2023-12-20  1:47   ` Paul E. McKenney
  2023-12-21 10:56     ` Uladzislau Rezki
  0 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2023-12-20  1:47 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: RCU, Neeraj upadhyay, Boqun Feng, Hillf Danton, Joel Fernandes,
	LKML, Oleksiy Avramchenko, Frederic Weisbecker

On Tue, Nov 28, 2023 at 09:00:32AM +0100, Uladzislau Rezki (Sony) wrote:
> Move synchronize_rcu() main control data under the rcu_state
> structure. An access is done via "rcu_state" global variable.
> 
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

Could we please fold this into the earlier patches?  Your future self
might thank me.  ;-)

							Thanx, Paul

> ---
>  kernel/rcu/tree.c | 50 ++++++++++++++---------------------------------
>  kernel/rcu/tree.h | 19 ++++++++++++++++++
>  2 files changed, 34 insertions(+), 35 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 69663a6d5050..c0d3e46730e8 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1384,19 +1384,6 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
>  		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>  }
>  
> -/*
> - * A max threshold for synchronize_rcu() users which are
> - * awaken directly by the rcu_gp_kthread(). Left part is
> - * deferred to the main worker.
> - */
> -#define SR_MAX_USERS_WAKE_FROM_GP 5
> -#define SR_NORMAL_GP_WAIT_HEAD_MAX 5
> -
> -struct sr_wait_node {
> -	atomic_t inuse;
> -	struct llist_node node;
> -};
> -
>  /*
>   * There is a single llist, which is used for handling
>   * synchronize_rcu() users' enqueued rcu_synchronize nodes.
> @@ -1523,17 +1510,10 @@ struct sr_wait_node {
>   * +----------+     +--------+
>   *
>   */
> -static struct sr_normal_state {
> -	struct llist_head srs_next;	/* request a GP users. */
> -	struct llist_node *srs_wait_tail; /* wait for GP users. */
> -	struct llist_node *srs_done_tail; /* ready for GP users. */
> -	struct sr_wait_node srs_wait_nodes[SR_NORMAL_GP_WAIT_HEAD_MAX];
> -} sr;
> -
>  static bool rcu_sr_is_wait_head(struct llist_node *node)
>  {
> -	return &(sr.srs_wait_nodes)[0].node <= node &&
> -		node <= &(sr.srs_wait_nodes)[SR_NORMAL_GP_WAIT_HEAD_MAX - 1].node;
> +	return &(rcu_state.srs_wait_nodes)[0].node <= node &&
> +		node <= &(rcu_state.srs_wait_nodes)[SR_NORMAL_GP_WAIT_HEAD_MAX - 1].node;
>  }
>  
>  static struct llist_node *rcu_sr_get_wait_head(void)
> @@ -1542,7 +1522,7 @@ static struct llist_node *rcu_sr_get_wait_head(void)
>  	int i;
>  
>  	for (i = 0; i < SR_NORMAL_GP_WAIT_HEAD_MAX; i++) {
> -		sr_wn = &(sr.srs_wait_nodes)[i];
> +		sr_wn = &(rcu_state.srs_wait_nodes)[i];
>  
>  		if (!atomic_cmpxchg_acquire(&sr_wn->inuse, 0, 1))
>  			return &sr_wn->node;
> @@ -1590,7 +1570,7 @@ static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
>  	 * cannot execute concurrently by multiple kworkers,
>  	 * the done tail list manipulations are protected here.
>  	 */
> -	done = smp_load_acquire(&sr.srs_done_tail);
> +	done = smp_load_acquire(&rcu_state.srs_done_tail);
>  	if (!done)
>  		return;
>  
> @@ -1626,12 +1606,12 @@ static void rcu_sr_normal_gp_cleanup(void)
>  	struct llist_node *wait_tail, *head, *rcu;
>  	int done = 0;
>  
> -	wait_tail = sr.srs_wait_tail;
> +	wait_tail = rcu_state.srs_wait_tail;
>  	if (wait_tail == NULL)
>  		return;
>  
> -	sr.srs_wait_tail = NULL;
> -	ASSERT_EXCLUSIVE_WRITER(sr.srs_wait_tail);
> +	rcu_state.srs_wait_tail = NULL;
> +	ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_wait_tail);
>  
>  	WARN_ON_ONCE(!rcu_sr_is_wait_head(wait_tail));
>  	head = wait_tail->next;
> @@ -1662,8 +1642,8 @@ static void rcu_sr_normal_gp_cleanup(void)
>  	}
>  
>  	// concurrent sr_normal_gp_cleanup work might observe this update.
> -	smp_store_release(&sr.srs_done_tail, wait_tail);
> -	ASSERT_EXCLUSIVE_WRITER(sr.srs_done_tail);
> +	smp_store_release(&rcu_state.srs_done_tail, wait_tail);
> +	ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_done_tail);
>  
>  	if (wait_tail->next)
>  		queue_work(system_highpri_wq, &sr_normal_gp_cleanup);
> @@ -1678,7 +1658,7 @@ static bool rcu_sr_normal_gp_init(void)
>  	struct llist_node *wait_head;
>  	bool start_new_poll = false;
>  
> -	first = READ_ONCE(sr.srs_next.first);
> +	first = READ_ONCE(rcu_state.srs_next.first);
>  	if (!first || rcu_sr_is_wait_head(first))
>  		return start_new_poll;
>  
> @@ -1690,23 +1670,23 @@ static bool rcu_sr_normal_gp_init(void)
>  	}
>  
>  	/* Inject a wait-dummy-node. */
> -	llist_add(wait_head, &sr.srs_next);
> +	llist_add(wait_head, &rcu_state.srs_next);
>  
>  	/*
>  	 * A waiting list of rcu_synchronize nodes should be empty on
>  	 * this step, since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
>  	 * rolls it over. If not, it is a BUG, warn a user.
>  	 */
> -	WARN_ON_ONCE(sr.srs_wait_tail != NULL);
> -	sr.srs_wait_tail = wait_head;
> -	ASSERT_EXCLUSIVE_WRITER(sr.srs_wait_tail);
> +	WARN_ON_ONCE(rcu_state.srs_wait_tail != NULL);
> +	rcu_state.srs_wait_tail = wait_head;
> +	ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_wait_tail);
>  
>  	return start_new_poll;
>  }
>  
>  static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
>  {
> -	llist_add((struct llist_node *) &rs->head, &sr.srs_next);
> +	llist_add((struct llist_node *) &rs->head, &rcu_state.srs_next);
>  }
>  
>  /*
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index 192536916f9a..f72166b5067a 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -316,6 +316,19 @@ do {									\
>  	__set_current_state(TASK_RUNNING);				\
>  } while (0)
>  
> +/*
> + * A max threshold for synchronize_rcu() users which are
> + * awaken directly by the rcu_gp_kthread(). Left part is
> + * deferred to the main worker.
> + */
> +#define SR_MAX_USERS_WAKE_FROM_GP 5
> +#define SR_NORMAL_GP_WAIT_HEAD_MAX 5
> +
> +struct sr_wait_node {
> +	atomic_t inuse;
> +	struct llist_node node;
> +};
> +
>  /*
>   * RCU global state, including node hierarchy.  This hierarchy is
>   * represented in "heap" form in a dense array.  The root (first level)
> @@ -397,6 +410,12 @@ struct rcu_state {
>  						/* Synchronize offline with */
>  						/*  GP pre-initialization. */
>  	int nocb_is_setup;			/* nocb is setup from boot */
> +
> +	/* synchronize_rcu() part. */
> +	struct llist_head srs_next;	/* request a GP users. */
> +	struct llist_node *srs_wait_tail; /* wait for GP users. */
> +	struct llist_node *srs_done_tail; /* ready for GP users. */
> +	struct sr_wait_node srs_wait_nodes[SR_NORMAL_GP_WAIT_HEAD_MAX];
>  };
>  
>  /* Values for rcu_state structure's gp_flags field. */
> -- 
> 2.39.2
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 7/7] rcu: Add CONFIG_RCU_SR_NORMAL_DEBUG_GP
  2023-12-20  1:14   ` Paul E. McKenney
@ 2023-12-21 10:27     ` Uladzislau Rezki
  0 siblings, 0 replies; 31+ messages in thread
From: Uladzislau Rezki @ 2023-12-21 10:27 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Uladzislau Rezki (Sony), RCU, Neeraj upadhyay, Boqun Feng,
	Hillf Danton, Joel Fernandes, LKML, Oleksiy Avramchenko,
	Frederic Weisbecker

> On Tue, Nov 28, 2023 at 09:00:33AM +0100, Uladzislau Rezki (Sony) wrote:
> > This option enables additional debugging for detecting a grace
> > period incompletion for synchronize_rcu() users. If a GP is not
> > fully passed for any user, the warning message is emitted.
> > 
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> 
> Much better, thank you, as this avoids the possibility of false positives
> in production.  But to avoid potential bisection issues, could you please
> fold this into the first patch?
> 
Makes sense. I will fold it into the first one!

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 3/7] doc: Add rcutree.rcu_normal_wake_from_gp to kernel-parameters.txt
  2023-12-20  1:17   ` Paul E. McKenney
@ 2023-12-21 10:28     ` Uladzislau Rezki
  0 siblings, 0 replies; 31+ messages in thread
From: Uladzislau Rezki @ 2023-12-21 10:28 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Uladzislau Rezki (Sony), RCU, Neeraj upadhyay, Boqun Feng,
	Hillf Danton, Joel Fernandes, LKML, Oleksiy Avramchenko,
	Frederic Weisbecker

On Tue, Dec 19, 2023 at 05:17:19PM -0800, Paul E. McKenney wrote:
> On Tue, Nov 28, 2023 at 09:00:29AM +0100, Uladzislau Rezki (Sony) wrote:
> > This commit adds rcutree.rcu_normal_wake_from_gp description
> > to the kernel-parameters.txt file.
> > 
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> 
> Very good, but could you please fold this into the first patch, which
> is the one that adds this kernel parameter?
> 
No problem :)

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/7] rcu: Improve handling of synchronize_rcu() users
  2023-12-20  1:37   ` Paul E. McKenney
@ 2023-12-21 10:52     ` Uladzislau Rezki
  2023-12-21 18:40       ` Paul E. McKenney
  0 siblings, 1 reply; 31+ messages in thread
From: Uladzislau Rezki @ 2023-12-21 10:52 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Uladzislau Rezki (Sony), RCU, Neeraj upadhyay, Boqun Feng,
	Hillf Danton, Joel Fernandes, LKML, Oleksiy Avramchenko,
	Frederic Weisbecker

On Tue, Dec 19, 2023 at 05:37:56PM -0800, Paul E. McKenney wrote:
> On Tue, Nov 28, 2023 at 09:00:30AM +0100, Uladzislau Rezki (Sony) wrote:
> > From: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
> > 
> > Currently, processing of the next batch of rcu_synchronize nodes
> > for the new grace period, requires doing a llist reversal operation
> > to find the tail element of the list. This can be a very costly
> > operation (high number of cache misses) for a long list.
> > 
> > To address this, this patch introduces a "dummy-wait-node" entity.
> > At every grace period init, a new wait node is added to the llist.
> > This wait node is used as wait tail for this new grace period.
> > 
> > This allows lockless additions of new rcu_synchronize nodes in the
> > rcu_sr_normal_add_req(), while the cleanup work executes and does
> > the progress. The dummy nodes are removed on next round of cleanup
> > work execution.
> > 
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
> 
> This says that Uladzislau created the patch and that Neeraj
> acted as maintainer.  I am guessing that you both worked on it,
> in which case is should have the Co-developed-by tags as shown in
> Documentation/process/submitting-patches.rst.  Could you please update
> these to reflect the actual origin?
> 
Right. We both worked on it. Neeraj is an author whereas i should mark
myself as a Co-developed-by. This is a correct way. Thank you for
pointing on it!

>
> One question below toward the end.  There are probably others that I
> should be asking, but I have to start somewhere.  ;-)
> 
Good :)

> >  
> >  /*
> >   * Helper function for rcu_gp_init().
> >   */
> > -static void rcu_sr_normal_gp_init(void)
> > +static bool rcu_sr_normal_gp_init(void)
> >  {
> > -	struct llist_node *head, *tail;
> > +	struct llist_node *first;
> > +	struct llist_node *wait_head;
> > +	bool start_new_poll = false;
> >  
> > -	if (llist_empty(&sr.srs_next))
> > -		return;
> > +	first = READ_ONCE(sr.srs_next.first);
> > +	if (!first || rcu_sr_is_wait_head(first))
> > +		return start_new_poll;
> > +
> > +	wait_head = rcu_sr_get_wait_head();
> > +	if (!wait_head) {
> > +		// Kick another GP to retry.
> > +		start_new_poll = true;
> > +		return start_new_poll;
> > +	}
> >  
> > -	tail = llist_del_all(&sr.srs_next);
> > -	head = llist_reverse_order(tail);
> > +	/* Inject a wait-dummy-node. */
> > +	llist_add(wait_head, &sr.srs_next);
> >  
> >  	/*
> > -	 * A waiting list of GP should be empty on this step,
> > -	 * since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
> > +	 * A waiting list of rcu_synchronize nodes should be empty on
> > +	 * this step, since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
> >  	 * rolls it over. If not, it is a BUG, warn a user.
> >  	 */
> > -	WARN_ON_ONCE(!llist_empty(&sr.srs_wait));
> > +	WARN_ON_ONCE(sr.srs_wait_tail != NULL);
> > +	sr.srs_wait_tail = wait_head;
> > +	ASSERT_EXCLUSIVE_WRITER(sr.srs_wait_tail);
> >  
> > -	WRITE_ONCE(sr.srs_wait_tail, tail);
> > -	__llist_add_batch(head, tail, &sr.srs_wait);
> > +	return start_new_poll;
> >  }
> >  
> >  static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
> > @@ -1493,6 +1684,7 @@ static noinline_for_stack bool rcu_gp_init(void)
> >  	unsigned long mask;
> >  	struct rcu_data *rdp;
> >  	struct rcu_node *rnp = rcu_get_root();
> > +	bool start_new_poll;
> >  
> >  	WRITE_ONCE(rcu_state.gp_activity, jiffies);
> >  	raw_spin_lock_irq_rcu_node(rnp);
> > @@ -1517,11 +1709,15 @@ static noinline_for_stack bool rcu_gp_init(void)
> >  	/* Record GP times before starting GP, hence rcu_seq_start(). */
> >  	rcu_seq_start(&rcu_state.gp_seq);
> >  	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > -	rcu_sr_normal_gp_init();
> > +	start_new_poll = rcu_sr_normal_gp_init();
> >  	trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
> >  	rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
> >  	raw_spin_unlock_irq_rcu_node(rnp);
> >  
> > +	// New poll request after rnp unlock
> > +	if (start_new_poll)
> > +		(void) start_poll_synchronize_rcu();
> 
> You lost me on this one.  Anything that got moved to the wait list
> should be handled by the current grace period, right?  Or is the
> problem that rcu_sr_normal_gp_init() is being invoked after the call
> to rcu_seq_start()?  If that is the case, could it be moved ahead so
> that we don't need the extra grace period?
> 
> Or am I missing something subtle here?
> 
The problem is that, we are limited in number of "wait-heads" which we
add as a marker node for this/current grace period. If there are more clients
and there is no a wait-head available it means that a system, the deferred
kworker, is slow in processing callbacks, thus all wait-nodes are in use.

That is why we need an extra grace period. Basically to repeat our try one
more time, i.e. it might be that a current grace period is not able to handle
users due to the fact that a system is doing really slow, but this is rather
a corner case and is not a problem.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 6/7] rcu: Move sync related data to rcu_state structure
  2023-12-20  1:47   ` Paul E. McKenney
@ 2023-12-21 10:56     ` Uladzislau Rezki
  0 siblings, 0 replies; 31+ messages in thread
From: Uladzislau Rezki @ 2023-12-21 10:56 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Uladzislau Rezki (Sony), RCU, Neeraj upadhyay, Boqun Feng,
	Hillf Danton, Joel Fernandes, LKML, Oleksiy Avramchenko,
	Frederic Weisbecker

On Tue, Dec 19, 2023 at 05:47:47PM -0800, Paul E. McKenney wrote:
> On Tue, Nov 28, 2023 at 09:00:32AM +0100, Uladzislau Rezki (Sony) wrote:
> > Move synchronize_rcu() main control data under the rcu_state
> > structure. An access is done via "rcu_state" global variable.
> > 
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> 
> Could we please fold this into the earlier patches?  Your future self
> might thank me.  ;-)
> 
OK. No problem, i will see which one is the most appropriate for this :)

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 5/7] rcu: Support direct wake-up of synchronize_rcu() users
  2023-12-20  1:46   ` Paul E. McKenney
@ 2023-12-21 11:22     ` Uladzislau Rezki
  0 siblings, 0 replies; 31+ messages in thread
From: Uladzislau Rezki @ 2023-12-21 11:22 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Uladzislau Rezki (Sony), RCU, Neeraj upadhyay, Boqun Feng,
	Hillf Danton, Joel Fernandes, LKML, Oleksiy Avramchenko,
	Frederic Weisbecker

On Tue, Dec 19, 2023 at 05:46:11PM -0800, Paul E. McKenney wrote:
> On Tue, Nov 28, 2023 at 09:00:31AM +0100, Uladzislau Rezki (Sony) wrote:
> > This patch introduces a small enhancement which allows to do a
> > direct wake-up of synchronize_rcu() callers. It occurs after a
> > completion of grace period, thus by the gp-kthread.
> > 
> > Number of clients is limited by the hard-coded maximum allowed
> > threshold. The remaining part, if still exists is deferred to
> > a main worker.
> > 
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> 
> Nice optimization!
> 
> One question below.
> 
> > ---
> >  kernel/rcu/tree.c | 39 +++++++++++++++++++++++++++++++++++++--
> >  1 file changed, 37 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index d7b48996825f..69663a6d5050 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -1384,6 +1384,12 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
> >  		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> >  }
> >  
> > +/*
> > + * A max threshold for synchronize_rcu() users which are
> > + * awaken directly by the rcu_gp_kthread(). Left part is
> > + * deferred to the main worker.
> > + */
> > +#define SR_MAX_USERS_WAKE_FROM_GP 5
> >  #define SR_NORMAL_GP_WAIT_HEAD_MAX 5
> >  
> >  struct sr_wait_node {
> > @@ -1617,7 +1623,8 @@ static DECLARE_WORK(sr_normal_gp_cleanup, rcu_sr_normal_gp_cleanup_work);
> >   */
> >  static void rcu_sr_normal_gp_cleanup(void)
> >  {
> > -	struct llist_node *wait_tail;
> > +	struct llist_node *wait_tail, *head, *rcu;
> > +	int done = 0;
> >  
> >  	wait_tail = sr.srs_wait_tail;
> >  	if (wait_tail == NULL)
> > @@ -1626,11 +1633,39 @@ static void rcu_sr_normal_gp_cleanup(void)
> >  	sr.srs_wait_tail = NULL;
> >  	ASSERT_EXCLUSIVE_WRITER(sr.srs_wait_tail);
> >  
> > +	WARN_ON_ONCE(!rcu_sr_is_wait_head(wait_tail));
> > +	head = wait_tail->next;
> > +
> > +	/*
> > +	 * Process (a) and (d) cases. See an illustration. Apart of
> > +	 * that it handles the scenario when all clients are done,
> > +	 * wait-head is released if last. The worker is not kicked.
> > +	 */
> > +	llist_for_each_safe(rcu, head, head) {
> 
> This does appear to be a clever way to save eight bytes on the stack,
> but is our stack space really so restricted?  We are being invoked from
> the RCU GP kthread, which isn't using much stack, right?
> 
> If so, let's spend the extra local variable and spare the reader a
> trip to the llist_for_each_safe() definition.
> 
OK, you mean to go with an extra "next" variable to use it in the
llist-loop. I will change it accordingly!

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/7] rcu: Improve handling of synchronize_rcu() users
  2023-12-21 10:52     ` Uladzislau Rezki
@ 2023-12-21 18:40       ` Paul E. McKenney
  2023-12-22  9:27         ` Uladzislau Rezki
  0 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2023-12-21 18:40 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: RCU, Neeraj upadhyay, Boqun Feng, Hillf Danton, Joel Fernandes,
	LKML, Oleksiy Avramchenko, Frederic Weisbecker

On Thu, Dec 21, 2023 at 11:52:33AM +0100, Uladzislau Rezki wrote:
> On Tue, Dec 19, 2023 at 05:37:56PM -0800, Paul E. McKenney wrote:
> > On Tue, Nov 28, 2023 at 09:00:30AM +0100, Uladzislau Rezki (Sony) wrote:
> > > From: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
> > > 
> > > Currently, processing of the next batch of rcu_synchronize nodes
> > > for the new grace period, requires doing a llist reversal operation
> > > to find the tail element of the list. This can be a very costly
> > > operation (high number of cache misses) for a long list.
> > > 
> > > To address this, this patch introduces a "dummy-wait-node" entity.
> > > At every grace period init, a new wait node is added to the llist.
> > > This wait node is used as wait tail for this new grace period.
> > > 
> > > This allows lockless additions of new rcu_synchronize nodes in the
> > > rcu_sr_normal_add_req(), while the cleanup work executes and does
> > > the progress. The dummy nodes are removed on next round of cleanup
> > > work execution.
> > > 
> > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
> > 
> > This says that Uladzislau created the patch and that Neeraj
> > acted as maintainer.  I am guessing that you both worked on it,
> > in which case is should have the Co-developed-by tags as shown in
> > Documentation/process/submitting-patches.rst.  Could you please update
> > these to reflect the actual origin?
> > 
> Right. We both worked on it. Neeraj is an author whereas i should mark
> myself as a Co-developed-by. This is a correct way. Thank you for
> pointing on it!

Sounds good, thank you!

> > One question below toward the end.  There are probably others that I
> > should be asking, but I have to start somewhere.  ;-)
> > 
> Good :)
> 
> > >  
> > >  /*
> > >   * Helper function for rcu_gp_init().
> > >   */
> > > -static void rcu_sr_normal_gp_init(void)
> > > +static bool rcu_sr_normal_gp_init(void)
> > >  {
> > > -	struct llist_node *head, *tail;
> > > +	struct llist_node *first;
> > > +	struct llist_node *wait_head;
> > > +	bool start_new_poll = false;
> > >  
> > > -	if (llist_empty(&sr.srs_next))
> > > -		return;
> > > +	first = READ_ONCE(sr.srs_next.first);
> > > +	if (!first || rcu_sr_is_wait_head(first))
> > > +		return start_new_poll;
> > > +
> > > +	wait_head = rcu_sr_get_wait_head();
> > > +	if (!wait_head) {
> > > +		// Kick another GP to retry.
> > > +		start_new_poll = true;
> > > +		return start_new_poll;
> > > +	}
> > >  
> > > -	tail = llist_del_all(&sr.srs_next);
> > > -	head = llist_reverse_order(tail);
> > > +	/* Inject a wait-dummy-node. */
> > > +	llist_add(wait_head, &sr.srs_next);
> > >  
> > >  	/*
> > > -	 * A waiting list of GP should be empty on this step,
> > > -	 * since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
> > > +	 * A waiting list of rcu_synchronize nodes should be empty on
> > > +	 * this step, since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
> > >  	 * rolls it over. If not, it is a BUG, warn a user.
> > >  	 */
> > > -	WARN_ON_ONCE(!llist_empty(&sr.srs_wait));
> > > +	WARN_ON_ONCE(sr.srs_wait_tail != NULL);
> > > +	sr.srs_wait_tail = wait_head;
> > > +	ASSERT_EXCLUSIVE_WRITER(sr.srs_wait_tail);
> > >  
> > > -	WRITE_ONCE(sr.srs_wait_tail, tail);
> > > -	__llist_add_batch(head, tail, &sr.srs_wait);
> > > +	return start_new_poll;
> > >  }
> > >  
> > >  static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
> > > @@ -1493,6 +1684,7 @@ static noinline_for_stack bool rcu_gp_init(void)
> > >  	unsigned long mask;
> > >  	struct rcu_data *rdp;
> > >  	struct rcu_node *rnp = rcu_get_root();
> > > +	bool start_new_poll;
> > >  
> > >  	WRITE_ONCE(rcu_state.gp_activity, jiffies);
> > >  	raw_spin_lock_irq_rcu_node(rnp);
> > > @@ -1517,11 +1709,15 @@ static noinline_for_stack bool rcu_gp_init(void)
> > >  	/* Record GP times before starting GP, hence rcu_seq_start(). */
> > >  	rcu_seq_start(&rcu_state.gp_seq);
> > >  	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > > -	rcu_sr_normal_gp_init();
> > > +	start_new_poll = rcu_sr_normal_gp_init();
> > >  	trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
> > >  	rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
> > >  	raw_spin_unlock_irq_rcu_node(rnp);
> > >  
> > > +	// New poll request after rnp unlock
> > > +	if (start_new_poll)
> > > +		(void) start_poll_synchronize_rcu();
> > 
> > You lost me on this one.  Anything that got moved to the wait list
> > should be handled by the current grace period, right?  Or is the
> > problem that rcu_sr_normal_gp_init() is being invoked after the call
> > to rcu_seq_start()?  If that is the case, could it be moved ahead so
> > that we don't need the extra grace period?
> > 
> > Or am I missing something subtle here?
> > 
> The problem is that, we are limited in number of "wait-heads" which we
> add as a marker node for this/current grace period. If there are more clients
> and there is no a wait-head available it means that a system, the deferred
> kworker, is slow in processing callbacks, thus all wait-nodes are in use.
> 
> That is why we need an extra grace period. Basically to repeat our try one
> more time, i.e. it might be that a current grace period is not able to handle
> users due to the fact that a system is doing really slow, but this is rather
> a corner case and is not a problem.

But in that case, the real issue is not the need for an extra grace
period, but rather the need for the wakeup processing to happen, correct?
Or am I missing something subtle here?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/7] rcu: Improve handling of synchronize_rcu() users
  2023-12-21 18:40       ` Paul E. McKenney
@ 2023-12-22  9:27         ` Uladzislau Rezki
  2023-12-22 18:58           ` Paul E. McKenney
  0 siblings, 1 reply; 31+ messages in thread
From: Uladzislau Rezki @ 2023-12-22  9:27 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Uladzislau Rezki, RCU, Neeraj upadhyay, Boqun Feng, Hillf Danton,
	Joel Fernandes, LKML, Oleksiy Avramchenko, Frederic Weisbecker

On Thu, Dec 21, 2023 at 10:40:21AM -0800, Paul E. McKenney wrote:
> On Thu, Dec 21, 2023 at 11:52:33AM +0100, Uladzislau Rezki wrote:
> > On Tue, Dec 19, 2023 at 05:37:56PM -0800, Paul E. McKenney wrote:
> > > On Tue, Nov 28, 2023 at 09:00:30AM +0100, Uladzislau Rezki (Sony) wrote:
> > > > From: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
> > > > 
> > > > Currently, processing of the next batch of rcu_synchronize nodes
> > > > for the new grace period, requires doing a llist reversal operation
> > > > to find the tail element of the list. This can be a very costly
> > > > operation (high number of cache misses) for a long list.
> > > > 
> > > > To address this, this patch introduces a "dummy-wait-node" entity.
> > > > At every grace period init, a new wait node is added to the llist.
> > > > This wait node is used as wait tail for this new grace period.
> > > > 
> > > > This allows lockless additions of new rcu_synchronize nodes in the
> > > > rcu_sr_normal_add_req(), while the cleanup work executes and does
> > > > the progress. The dummy nodes are removed on next round of cleanup
> > > > work execution.
> > > > 
> > > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
> > > 
> > > This says that Uladzislau created the patch and that Neeraj
> > > acted as maintainer.  I am guessing that you both worked on it,
> > > in which case is should have the Co-developed-by tags as shown in
> > > Documentation/process/submitting-patches.rst.  Could you please update
> > > these to reflect the actual origin?
> > > 
> > Right. We both worked on it. Neeraj is an author whereas i should mark
> > myself as a Co-developed-by. This is a correct way. Thank you for
> > pointing on it!
> 
> Sounds good, thank you!
> 
> > > One question below toward the end.  There are probably others that I
> > > should be asking, but I have to start somewhere.  ;-)
> > > 
> > Good :)
> > 
> > > >  
> > > >  /*
> > > >   * Helper function for rcu_gp_init().
> > > >   */
> > > > -static void rcu_sr_normal_gp_init(void)
> > > > +static bool rcu_sr_normal_gp_init(void)
> > > >  {
> > > > -	struct llist_node *head, *tail;
> > > > +	struct llist_node *first;
> > > > +	struct llist_node *wait_head;
> > > > +	bool start_new_poll = false;
> > > >  
> > > > -	if (llist_empty(&sr.srs_next))
> > > > -		return;
> > > > +	first = READ_ONCE(sr.srs_next.first);
> > > > +	if (!first || rcu_sr_is_wait_head(first))
> > > > +		return start_new_poll;
> > > > +
> > > > +	wait_head = rcu_sr_get_wait_head();
> > > > +	if (!wait_head) {
> > > > +		// Kick another GP to retry.
> > > > +		start_new_poll = true;
> > > > +		return start_new_poll;
> > > > +	}
> > > >  
> > > > -	tail = llist_del_all(&sr.srs_next);
> > > > -	head = llist_reverse_order(tail);
> > > > +	/* Inject a wait-dummy-node. */
> > > > +	llist_add(wait_head, &sr.srs_next);
> > > >  
> > > >  	/*
> > > > -	 * A waiting list of GP should be empty on this step,
> > > > -	 * since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
> > > > +	 * A waiting list of rcu_synchronize nodes should be empty on
> > > > +	 * this step, since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
> > > >  	 * rolls it over. If not, it is a BUG, warn a user.
> > > >  	 */
> > > > -	WARN_ON_ONCE(!llist_empty(&sr.srs_wait));
> > > > +	WARN_ON_ONCE(sr.srs_wait_tail != NULL);
> > > > +	sr.srs_wait_tail = wait_head;
> > > > +	ASSERT_EXCLUSIVE_WRITER(sr.srs_wait_tail);
> > > >  
> > > > -	WRITE_ONCE(sr.srs_wait_tail, tail);
> > > > -	__llist_add_batch(head, tail, &sr.srs_wait);
> > > > +	return start_new_poll;
> > > >  }
> > > >  
> > > >  static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
> > > > @@ -1493,6 +1684,7 @@ static noinline_for_stack bool rcu_gp_init(void)
> > > >  	unsigned long mask;
> > > >  	struct rcu_data *rdp;
> > > >  	struct rcu_node *rnp = rcu_get_root();
> > > > +	bool start_new_poll;
> > > >  
> > > >  	WRITE_ONCE(rcu_state.gp_activity, jiffies);
> > > >  	raw_spin_lock_irq_rcu_node(rnp);
> > > > @@ -1517,11 +1709,15 @@ static noinline_for_stack bool rcu_gp_init(void)
> > > >  	/* Record GP times before starting GP, hence rcu_seq_start(). */
> > > >  	rcu_seq_start(&rcu_state.gp_seq);
> > > >  	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > > > -	rcu_sr_normal_gp_init();
> > > > +	start_new_poll = rcu_sr_normal_gp_init();
> > > >  	trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
> > > >  	rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
> > > >  	raw_spin_unlock_irq_rcu_node(rnp);
> > > >  
> > > > +	// New poll request after rnp unlock
> > > > +	if (start_new_poll)
> > > > +		(void) start_poll_synchronize_rcu();
> > > 
> > > You lost me on this one.  Anything that got moved to the wait list
> > > should be handled by the current grace period, right?  Or is the
> > > problem that rcu_sr_normal_gp_init() is being invoked after the call
> > > to rcu_seq_start()?  If that is the case, could it be moved ahead so
> > > that we don't need the extra grace period?
> > > 
> > > Or am I missing something subtle here?
> > > 
> > The problem is that, we are limited in number of "wait-heads" which we
> > add as a marker node for this/current grace period. If there are more clients
> > and there is no a wait-head available it means that a system, the deferred
> > kworker, is slow in processing callbacks, thus all wait-nodes are in use.
> > 
> > That is why we need an extra grace period. Basically to repeat our try one
> > more time, i.e. it might be that a current grace period is not able to handle
> > users due to the fact that a system is doing really slow, but this is rather
> > a corner case and is not a problem.
> 
> But in that case, the real issue is not the need for an extra grace
> period, but rather the need for the wakeup processing to happen, correct?
> Or am I missing something subtle here?
> 
Basically, yes. If we had a spare dummy-node we could process the users
by the current GP(no need in extra). Why we may not have it - it is because
like you pointed:

- wake-up issue, i.e. wake-up time + when we are on_cpu;
- slow list process. For example priority. The kworker is not
  given enough CPU time to do the progress, thus "dummy-nodes"
  are not released in time for reuse.

Therefore, en extra GP is requested if there is a high flow of
synchronize_rcu() users and kworker is not able to do a progress
in time.

For example 60K+ parallel synchronize_rcu() users will trigger it.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/7] rcu: Improve handling of synchronize_rcu() users
  2023-12-22  9:27         ` Uladzislau Rezki
@ 2023-12-22 18:58           ` Paul E. McKenney
  2024-01-02 12:52             ` Uladzislau Rezki
  0 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2023-12-22 18:58 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: RCU, Neeraj upadhyay, Boqun Feng, Hillf Danton, Joel Fernandes,
	LKML, Oleksiy Avramchenko, Frederic Weisbecker

On Fri, Dec 22, 2023 at 10:27:41AM +0100, Uladzislau Rezki wrote:
> On Thu, Dec 21, 2023 at 10:40:21AM -0800, Paul E. McKenney wrote:
> > On Thu, Dec 21, 2023 at 11:52:33AM +0100, Uladzislau Rezki wrote:
> > > On Tue, Dec 19, 2023 at 05:37:56PM -0800, Paul E. McKenney wrote:
> > > > On Tue, Nov 28, 2023 at 09:00:30AM +0100, Uladzislau Rezki (Sony) wrote:
> > > > > From: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
> > > > > 
> > > > > Currently, processing of the next batch of rcu_synchronize nodes
> > > > > for the new grace period, requires doing a llist reversal operation
> > > > > to find the tail element of the list. This can be a very costly
> > > > > operation (high number of cache misses) for a long list.
> > > > > 
> > > > > To address this, this patch introduces a "dummy-wait-node" entity.
> > > > > At every grace period init, a new wait node is added to the llist.
> > > > > This wait node is used as wait tail for this new grace period.
> > > > > 
> > > > > This allows lockless additions of new rcu_synchronize nodes in the
> > > > > rcu_sr_normal_add_req(), while the cleanup work executes and does
> > > > > the progress. The dummy nodes are removed on next round of cleanup
> > > > > work execution.
> > > > > 
> > > > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > > Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
> > > > 
> > > > This says that Uladzislau created the patch and that Neeraj
> > > > acted as maintainer.  I am guessing that you both worked on it,
> > > > in which case is should have the Co-developed-by tags as shown in
> > > > Documentation/process/submitting-patches.rst.  Could you please update
> > > > these to reflect the actual origin?
> > > > 
> > > Right. We both worked on it. Neeraj is an author whereas i should mark
> > > myself as a Co-developed-by. This is a correct way. Thank you for
> > > pointing on it!
> > 
> > Sounds good, thank you!
> > 
> > > > One question below toward the end.  There are probably others that I
> > > > should be asking, but I have to start somewhere.  ;-)
> > > > 
> > > Good :)
> > > 
> > > > >  
> > > > >  /*
> > > > >   * Helper function for rcu_gp_init().
> > > > >   */
> > > > > -static void rcu_sr_normal_gp_init(void)
> > > > > +static bool rcu_sr_normal_gp_init(void)
> > > > >  {
> > > > > -	struct llist_node *head, *tail;
> > > > > +	struct llist_node *first;
> > > > > +	struct llist_node *wait_head;
> > > > > +	bool start_new_poll = false;
> > > > >  
> > > > > -	if (llist_empty(&sr.srs_next))
> > > > > -		return;
> > > > > +	first = READ_ONCE(sr.srs_next.first);
> > > > > +	if (!first || rcu_sr_is_wait_head(first))
> > > > > +		return start_new_poll;
> > > > > +
> > > > > +	wait_head = rcu_sr_get_wait_head();
> > > > > +	if (!wait_head) {
> > > > > +		// Kick another GP to retry.
> > > > > +		start_new_poll = true;
> > > > > +		return start_new_poll;
> > > > > +	}
> > > > >  
> > > > > -	tail = llist_del_all(&sr.srs_next);
> > > > > -	head = llist_reverse_order(tail);
> > > > > +	/* Inject a wait-dummy-node. */
> > > > > +	llist_add(wait_head, &sr.srs_next);
> > > > >  
> > > > >  	/*
> > > > > -	 * A waiting list of GP should be empty on this step,
> > > > > -	 * since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
> > > > > +	 * A waiting list of rcu_synchronize nodes should be empty on
> > > > > +	 * this step, since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
> > > > >  	 * rolls it over. If not, it is a BUG, warn a user.
> > > > >  	 */
> > > > > -	WARN_ON_ONCE(!llist_empty(&sr.srs_wait));
> > > > > +	WARN_ON_ONCE(sr.srs_wait_tail != NULL);
> > > > > +	sr.srs_wait_tail = wait_head;
> > > > > +	ASSERT_EXCLUSIVE_WRITER(sr.srs_wait_tail);
> > > > >  
> > > > > -	WRITE_ONCE(sr.srs_wait_tail, tail);
> > > > > -	__llist_add_batch(head, tail, &sr.srs_wait);
> > > > > +	return start_new_poll;
> > > > >  }
> > > > >  
> > > > >  static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
> > > > > @@ -1493,6 +1684,7 @@ static noinline_for_stack bool rcu_gp_init(void)
> > > > >  	unsigned long mask;
> > > > >  	struct rcu_data *rdp;
> > > > >  	struct rcu_node *rnp = rcu_get_root();
> > > > > +	bool start_new_poll;
> > > > >  
> > > > >  	WRITE_ONCE(rcu_state.gp_activity, jiffies);
> > > > >  	raw_spin_lock_irq_rcu_node(rnp);
> > > > > @@ -1517,11 +1709,15 @@ static noinline_for_stack bool rcu_gp_init(void)
> > > > >  	/* Record GP times before starting GP, hence rcu_seq_start(). */
> > > > >  	rcu_seq_start(&rcu_state.gp_seq);
> > > > >  	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > > > > -	rcu_sr_normal_gp_init();
> > > > > +	start_new_poll = rcu_sr_normal_gp_init();
> > > > >  	trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
> > > > >  	rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
> > > > >  	raw_spin_unlock_irq_rcu_node(rnp);
> > > > >  
> > > > > +	// New poll request after rnp unlock
> > > > > +	if (start_new_poll)
> > > > > +		(void) start_poll_synchronize_rcu();
> > > > 
> > > > You lost me on this one.  Anything that got moved to the wait list
> > > > should be handled by the current grace period, right?  Or is the
> > > > problem that rcu_sr_normal_gp_init() is being invoked after the call
> > > > to rcu_seq_start()?  If that is the case, could it be moved ahead so
> > > > that we don't need the extra grace period?
> > > > 
> > > > Or am I missing something subtle here?
> > > > 
> > > The problem is that, we are limited in number of "wait-heads" which we
> > > add as a marker node for this/current grace period. If there are more clients
> > > and there is no a wait-head available it means that a system, the deferred
> > > kworker, is slow in processing callbacks, thus all wait-nodes are in use.
> > > 
> > > That is why we need an extra grace period. Basically to repeat our try one
> > > more time, i.e. it might be that a current grace period is not able to handle
> > > users due to the fact that a system is doing really slow, but this is rather
> > > a corner case and is not a problem.
> > 
> > But in that case, the real issue is not the need for an extra grace
> > period, but rather the need for the wakeup processing to happen, correct?
> > Or am I missing something subtle here?
> > 
> Basically, yes. If we had a spare dummy-node we could process the users
> by the current GP(no need in extra). Why we may not have it - it is because
> like you pointed:
> 
> - wake-up issue, i.e. wake-up time + when we are on_cpu;
> - slow list process. For example priority. The kworker is not
>   given enough CPU time to do the progress, thus "dummy-nodes"
>   are not released in time for reuse.
> 
> Therefore, en extra GP is requested if there is a high flow of
> synchronize_rcu() users and kworker is not able to do a progress
> in time.
> 
> For example 60K+ parallel synchronize_rcu() users will trigger it.

OK, but what bad thing would happen if that was moved to precede the
rcu_seq_start(&rcu_state.gp_seq)?  That way, the requested grace period
would be the same as the one that is just now starting.

Something like this?

	start_new_poll = rcu_sr_normal_gp_init();

	/* Record GP times before starting GP, hence rcu_seq_start(). */
	rcu_seq_start(&rcu_state.gp_seq);
	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);

	trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
	rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
	raw_spin_unlock_irq_rcu_node(rnp);

	// New poll request after rnp unlock
	if (start_new_poll)
		(void) start_poll_synchronize_rcu();

Yes, rcu_sr_normal_gp_init() might need some adjustment given that it
is seeing the pre-GP value of rcu_state.gp_seq.

But unless I am missing something, what you have now can result in
extra grace periods, which incur overhead on what would otherwise be an
idle system.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/7] rcu: Improve handling of synchronize_rcu() users
  2023-12-22 18:58           ` Paul E. McKenney
@ 2024-01-02 12:52             ` Uladzislau Rezki
  2024-01-02 19:25               ` Paul E. McKenney
  0 siblings, 1 reply; 31+ messages in thread
From: Uladzislau Rezki @ 2024-01-02 12:52 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Uladzislau Rezki, RCU, Neeraj upadhyay, Boqun Feng, Hillf Danton,
	Joel Fernandes, LKML, Oleksiy Avramchenko, Frederic Weisbecker

Hello, Paul!

Sorry for late answer, it is because of holidays :)

> > > > The problem is that, we are limited in number of "wait-heads" which we
> > > > add as a marker node for this/current grace period. If there are more clients
> > > > and there is no a wait-head available it means that a system, the deferred
> > > > kworker, is slow in processing callbacks, thus all wait-nodes are in use.
> > > > 
> > > > That is why we need an extra grace period. Basically to repeat our try one
> > > > more time, i.e. it might be that a current grace period is not able to handle
> > > > users due to the fact that a system is doing really slow, but this is rather
> > > > a corner case and is not a problem.
> > > 
> > > But in that case, the real issue is not the need for an extra grace
> > > period, but rather the need for the wakeup processing to happen, correct?
> > > Or am I missing something subtle here?
> > > 
> > Basically, yes. If we had a spare dummy-node we could process the users
> > by the current GP(no need in extra). Why we may not have it - it is because
> > like you pointed:
> > 
> > - wake-up issue, i.e. wake-up time + when we are on_cpu;
> > - slow list process. For example priority. The kworker is not
> >   given enough CPU time to do the progress, thus "dummy-nodes"
> >   are not released in time for reuse.
> > 
> > Therefore, en extra GP is requested if there is a high flow of
> > synchronize_rcu() users and kworker is not able to do a progress
> > in time.
> > 
> > For example 60K+ parallel synchronize_rcu() users will trigger it.
> 
> OK, but what bad thing would happen if that was moved to precede the
> rcu_seq_start(&rcu_state.gp_seq)?  That way, the requested grace period
> would be the same as the one that is just now starting.
> 
> Something like this?
> 
> 	start_new_poll = rcu_sr_normal_gp_init();
> 
> 	/* Record GP times before starting GP, hence rcu_seq_start(). */
> 	rcu_seq_start(&rcu_state.gp_seq);
> 	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
>
I had a concern about the case when rcu_sr_normal_gp_init() handles what
we currently have, in terms of requests. Right after that there is/are
extra sync requests which invoke the start_poll_synchronize_rcu() but
since a GP has been requested before it will not request an extra one. So
"last" incoming users might not be processed.

That is why i have placed the rcu_sr_normal_gp_init() after a gp_seq is
updated.

I can miss something, so please comment. Apart of that we can move it
as you proposed.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/7] rcu: Improve handling of synchronize_rcu() users
  2024-01-02 12:52             ` Uladzislau Rezki
@ 2024-01-02 19:25               ` Paul E. McKenney
  2024-01-03 13:16                 ` Uladzislau Rezki
  0 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2024-01-02 19:25 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: RCU, Neeraj upadhyay, Boqun Feng, Hillf Danton, Joel Fernandes,
	LKML, Oleksiy Avramchenko, Frederic Weisbecker

On Tue, Jan 02, 2024 at 01:52:26PM +0100, Uladzislau Rezki wrote:
> Hello, Paul!
> 
> Sorry for late answer, it is because of holidays :)
> 
> > > > > The problem is that, we are limited in number of "wait-heads" which we
> > > > > add as a marker node for this/current grace period. If there are more clients
> > > > > and there is no a wait-head available it means that a system, the deferred
> > > > > kworker, is slow in processing callbacks, thus all wait-nodes are in use.
> > > > > 
> > > > > That is why we need an extra grace period. Basically to repeat our try one
> > > > > more time, i.e. it might be that a current grace period is not able to handle
> > > > > users due to the fact that a system is doing really slow, but this is rather
> > > > > a corner case and is not a problem.
> > > > 
> > > > But in that case, the real issue is not the need for an extra grace
> > > > period, but rather the need for the wakeup processing to happen, correct?
> > > > Or am I missing something subtle here?
> > > > 
> > > Basically, yes. If we had a spare dummy-node we could process the users
> > > by the current GP(no need in extra). Why we may not have it - it is because
> > > like you pointed:
> > > 
> > > - wake-up issue, i.e. wake-up time + when we are on_cpu;
> > > - slow list process. For example priority. The kworker is not
> > >   given enough CPU time to do the progress, thus "dummy-nodes"
> > >   are not released in time for reuse.
> > > 
> > > Therefore, en extra GP is requested if there is a high flow of
> > > synchronize_rcu() users and kworker is not able to do a progress
> > > in time.
> > > 
> > > For example 60K+ parallel synchronize_rcu() users will trigger it.
> > 
> > OK, but what bad thing would happen if that was moved to precede the
> > rcu_seq_start(&rcu_state.gp_seq)?  That way, the requested grace period
> > would be the same as the one that is just now starting.
> > 
> > Something like this?
> > 
> > 	start_new_poll = rcu_sr_normal_gp_init();
> > 
> > 	/* Record GP times before starting GP, hence rcu_seq_start(). */
> > 	rcu_seq_start(&rcu_state.gp_seq);
> > 	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> >
> I had a concern about the case when rcu_sr_normal_gp_init() handles what
> we currently have, in terms of requests. Right after that there is/are
> extra sync requests which invoke the start_poll_synchronize_rcu() but
> since a GP has been requested before it will not request an extra one. So
> "last" incoming users might not be processed.
> 
> That is why i have placed the rcu_sr_normal_gp_init() after a gp_seq is
> updated.
> 
> I can miss something, so please comment. Apart of that we can move it
> as you proposed.

Couldn't that possibility be handled by a check in rcu_gp_cleanup()?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/7] rcu: Improve handling of synchronize_rcu() users
  2024-01-02 19:25               ` Paul E. McKenney
@ 2024-01-03 13:16                 ` Uladzislau Rezki
  2024-01-03 14:47                   ` Paul E. McKenney
  0 siblings, 1 reply; 31+ messages in thread
From: Uladzislau Rezki @ 2024-01-03 13:16 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Uladzislau Rezki, RCU, Neeraj upadhyay, Boqun Feng, Hillf Danton,
	Joel Fernandes, LKML, Oleksiy Avramchenko, Frederic Weisbecker

On Tue, Jan 02, 2024 at 11:25:13AM -0800, Paul E. McKenney wrote:
> On Tue, Jan 02, 2024 at 01:52:26PM +0100, Uladzislau Rezki wrote:
> > Hello, Paul!
> > 
> > Sorry for late answer, it is because of holidays :)
> > 
> > > > > > The problem is that, we are limited in number of "wait-heads" which we
> > > > > > add as a marker node for this/current grace period. If there are more clients
> > > > > > and there is no a wait-head available it means that a system, the deferred
> > > > > > kworker, is slow in processing callbacks, thus all wait-nodes are in use.
> > > > > > 
> > > > > > That is why we need an extra grace period. Basically to repeat our try one
> > > > > > more time, i.e. it might be that a current grace period is not able to handle
> > > > > > users due to the fact that a system is doing really slow, but this is rather
> > > > > > a corner case and is not a problem.
> > > > > 
> > > > > But in that case, the real issue is not the need for an extra grace
> > > > > period, but rather the need for the wakeup processing to happen, correct?
> > > > > Or am I missing something subtle here?
> > > > > 
> > > > Basically, yes. If we had a spare dummy-node we could process the users
> > > > by the current GP(no need in extra). Why we may not have it - it is because
> > > > like you pointed:
> > > > 
> > > > - wake-up issue, i.e. wake-up time + when we are on_cpu;
> > > > - slow list process. For example priority. The kworker is not
> > > >   given enough CPU time to do the progress, thus "dummy-nodes"
> > > >   are not released in time for reuse.
> > > > 
> > > > Therefore, en extra GP is requested if there is a high flow of
> > > > synchronize_rcu() users and kworker is not able to do a progress
> > > > in time.
> > > > 
> > > > For example 60K+ parallel synchronize_rcu() users will trigger it.
> > > 
> > > OK, but what bad thing would happen if that was moved to precede the
> > > rcu_seq_start(&rcu_state.gp_seq)?  That way, the requested grace period
> > > would be the same as the one that is just now starting.
> > > 
> > > Something like this?
> > > 
> > > 	start_new_poll = rcu_sr_normal_gp_init();
> > > 
> > > 	/* Record GP times before starting GP, hence rcu_seq_start(). */
> > > 	rcu_seq_start(&rcu_state.gp_seq);
> > > 	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > >
> > I had a concern about the case when rcu_sr_normal_gp_init() handles what
> > we currently have, in terms of requests. Right after that there is/are
> > extra sync requests which invoke the start_poll_synchronize_rcu() but
> > since a GP has been requested before it will not request an extra one. So
> > "last" incoming users might not be processed.
> > 
> > That is why i have placed the rcu_sr_normal_gp_init() after a gp_seq is
> > updated.
> > 
> > I can miss something, so please comment. Apart of that we can move it
> > as you proposed.
> 
> Couldn't that possibility be handled by a check in rcu_gp_cleanup()?
> 
It is controlled by the caller anyway, i.e. if a new GP is needed.

I am not 100% sure it is as straightforward as it could look like to
handle it in the rcu_sr_normal_gp_cleaup() function. At least i see
that we need to access to the first element of llist and find out if
it is a wait-dummy-head or not. If not we know there are extra incoming
calls.

So that way requires extra calling of start_poll_synchronize_rcu().

I can add a comment about your concern and we can find the best approach
later, if it is OK with you!

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/7] rcu: Improve handling of synchronize_rcu() users
  2024-01-03 13:16                 ` Uladzislau Rezki
@ 2024-01-03 14:47                   ` Paul E. McKenney
  2024-01-03 17:35                     ` Uladzislau Rezki
  0 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2024-01-03 14:47 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: RCU, Neeraj upadhyay, Boqun Feng, Hillf Danton, Joel Fernandes,
	LKML, Oleksiy Avramchenko, Frederic Weisbecker

On Wed, Jan 03, 2024 at 02:16:00PM +0100, Uladzislau Rezki wrote:
> On Tue, Jan 02, 2024 at 11:25:13AM -0800, Paul E. McKenney wrote:
> > On Tue, Jan 02, 2024 at 01:52:26PM +0100, Uladzislau Rezki wrote:
> > > Hello, Paul!
> > > 
> > > Sorry for late answer, it is because of holidays :)
> > > 
> > > > > > > The problem is that, we are limited in number of "wait-heads" which we
> > > > > > > add as a marker node for this/current grace period. If there are more clients
> > > > > > > and there is no a wait-head available it means that a system, the deferred
> > > > > > > kworker, is slow in processing callbacks, thus all wait-nodes are in use.
> > > > > > > 
> > > > > > > That is why we need an extra grace period. Basically to repeat our try one
> > > > > > > more time, i.e. it might be that a current grace period is not able to handle
> > > > > > > users due to the fact that a system is doing really slow, but this is rather
> > > > > > > a corner case and is not a problem.
> > > > > > 
> > > > > > But in that case, the real issue is not the need for an extra grace
> > > > > > period, but rather the need for the wakeup processing to happen, correct?
> > > > > > Or am I missing something subtle here?
> > > > > > 
> > > > > Basically, yes. If we had a spare dummy-node we could process the users
> > > > > by the current GP(no need in extra). Why we may not have it - it is because
> > > > > like you pointed:
> > > > > 
> > > > > - wake-up issue, i.e. wake-up time + when we are on_cpu;
> > > > > - slow list process. For example priority. The kworker is not
> > > > >   given enough CPU time to do the progress, thus "dummy-nodes"
> > > > >   are not released in time for reuse.
> > > > > 
> > > > > Therefore, en extra GP is requested if there is a high flow of
> > > > > synchronize_rcu() users and kworker is not able to do a progress
> > > > > in time.
> > > > > 
> > > > > For example 60K+ parallel synchronize_rcu() users will trigger it.
> > > > 
> > > > OK, but what bad thing would happen if that was moved to precede the
> > > > rcu_seq_start(&rcu_state.gp_seq)?  That way, the requested grace period
> > > > would be the same as the one that is just now starting.
> > > > 
> > > > Something like this?
> > > > 
> > > > 	start_new_poll = rcu_sr_normal_gp_init();
> > > > 
> > > > 	/* Record GP times before starting GP, hence rcu_seq_start(). */
> > > > 	rcu_seq_start(&rcu_state.gp_seq);
> > > > 	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > > >
> > > I had a concern about the case when rcu_sr_normal_gp_init() handles what
> > > we currently have, in terms of requests. Right after that there is/are
> > > extra sync requests which invoke the start_poll_synchronize_rcu() but
> > > since a GP has been requested before it will not request an extra one. So
> > > "last" incoming users might not be processed.
> > > 
> > > That is why i have placed the rcu_sr_normal_gp_init() after a gp_seq is
> > > updated.
> > > 
> > > I can miss something, so please comment. Apart of that we can move it
> > > as you proposed.
> > 
> > Couldn't that possibility be handled by a check in rcu_gp_cleanup()?
> > 
> It is controlled by the caller anyway, i.e. if a new GP is needed.
> 
> I am not 100% sure it is as straightforward as it could look like to
> handle it in the rcu_sr_normal_gp_cleaup() function. At least i see
> that we need to access to the first element of llist and find out if
> it is a wait-dummy-head or not. If not we know there are extra incoming
> calls.
> 
> So that way requires extra calling of start_poll_synchronize_rcu().

If this is invoked early enough in rcu_gp_cleanup(), all that needs to
happen is to set the need_gp flag.  Plus you can count the number of
requests, and snapshot that number at rcu_gp_init() time and check to
see if it changed at rcu_gp_cleanup() time.  Later on, this could be
used to reduce the number of wakeups, correct?

> I can add a comment about your concern and we can find the best approach
> later, if it is OK with you!

I agree that this should be added via a later patch, though I have not
yet given up on the possibility that this patch might be simple enough
to be later in this same series.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/7] rcu: Improve handling of synchronize_rcu() users
  2024-01-03 14:47                   ` Paul E. McKenney
@ 2024-01-03 17:35                     ` Uladzislau Rezki
  2024-01-03 17:56                       ` Paul E. McKenney
  0 siblings, 1 reply; 31+ messages in thread
From: Uladzislau Rezki @ 2024-01-03 17:35 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Uladzislau Rezki, RCU, Neeraj upadhyay, Boqun Feng, Hillf Danton,
	Joel Fernandes, LKML, Oleksiy Avramchenko, Frederic Weisbecker

On Wed, Jan 03, 2024 at 06:47:30AM -0800, Paul E. McKenney wrote:
> On Wed, Jan 03, 2024 at 02:16:00PM +0100, Uladzislau Rezki wrote:
> > On Tue, Jan 02, 2024 at 11:25:13AM -0800, Paul E. McKenney wrote:
> > > On Tue, Jan 02, 2024 at 01:52:26PM +0100, Uladzislau Rezki wrote:
> > > > Hello, Paul!
> > > > 
> > > > Sorry for late answer, it is because of holidays :)
> > > > 
> > > > > > > > The problem is that, we are limited in number of "wait-heads" which we
> > > > > > > > add as a marker node for this/current grace period. If there are more clients
> > > > > > > > and there is no a wait-head available it means that a system, the deferred
> > > > > > > > kworker, is slow in processing callbacks, thus all wait-nodes are in use.
> > > > > > > > 
> > > > > > > > That is why we need an extra grace period. Basically to repeat our try one
> > > > > > > > more time, i.e. it might be that a current grace period is not able to handle
> > > > > > > > users due to the fact that a system is doing really slow, but this is rather
> > > > > > > > a corner case and is not a problem.
> > > > > > > 
> > > > > > > But in that case, the real issue is not the need for an extra grace
> > > > > > > period, but rather the need for the wakeup processing to happen, correct?
> > > > > > > Or am I missing something subtle here?
> > > > > > > 
> > > > > > Basically, yes. If we had a spare dummy-node we could process the users
> > > > > > by the current GP(no need in extra). Why we may not have it - it is because
> > > > > > like you pointed:
> > > > > > 
> > > > > > - wake-up issue, i.e. wake-up time + when we are on_cpu;
> > > > > > - slow list process. For example priority. The kworker is not
> > > > > >   given enough CPU time to do the progress, thus "dummy-nodes"
> > > > > >   are not released in time for reuse.
> > > > > > 
> > > > > > Therefore, en extra GP is requested if there is a high flow of
> > > > > > synchronize_rcu() users and kworker is not able to do a progress
> > > > > > in time.
> > > > > > 
> > > > > > For example 60K+ parallel synchronize_rcu() users will trigger it.
> > > > > 
> > > > > OK, but what bad thing would happen if that was moved to precede the
> > > > > rcu_seq_start(&rcu_state.gp_seq)?  That way, the requested grace period
> > > > > would be the same as the one that is just now starting.
> > > > > 
> > > > > Something like this?
> > > > > 
> > > > > 	start_new_poll = rcu_sr_normal_gp_init();
> > > > > 
> > > > > 	/* Record GP times before starting GP, hence rcu_seq_start(). */
> > > > > 	rcu_seq_start(&rcu_state.gp_seq);
> > > > > 	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > > > >
> > > > I had a concern about the case when rcu_sr_normal_gp_init() handles what
> > > > we currently have, in terms of requests. Right after that there is/are
> > > > extra sync requests which invoke the start_poll_synchronize_rcu() but
> > > > since a GP has been requested before it will not request an extra one. So
> > > > "last" incoming users might not be processed.
> > > > 
> > > > That is why i have placed the rcu_sr_normal_gp_init() after a gp_seq is
> > > > updated.
> > > > 
> > > > I can miss something, so please comment. Apart of that we can move it
> > > > as you proposed.
> > > 
> > > Couldn't that possibility be handled by a check in rcu_gp_cleanup()?
> > > 
> > It is controlled by the caller anyway, i.e. if a new GP is needed.
> > 
> > I am not 100% sure it is as straightforward as it could look like to
> > handle it in the rcu_sr_normal_gp_cleaup() function. At least i see
> > that we need to access to the first element of llist and find out if
> > it is a wait-dummy-head or not. If not we know there are extra incoming
> > calls.
> > 
> > So that way requires extra calling of start_poll_synchronize_rcu().
> 
> If this is invoked early enough in rcu_gp_cleanup(), all that needs to
> happen is to set the need_gp flag.  Plus you can count the number of
> requests, and snapshot that number at rcu_gp_init() time and check to
> see if it changed at rcu_gp_cleanup() time.  Later on, this could be
> used to reduce the number of wakeups, correct?
> 
You mean instead of waking-up a gp-kthread just continue processing of
new users if they are exist? If so, i think, we can implement it as separate
patches.

> > I can add a comment about your concern and we can find the best approach
> > later, if it is OK with you!
> 
> I agree that this should be added via a later patch, though I have not
> yet given up on the possibility that this patch might be simple enough
> to be later in this same series.
> 
Maybe there is a small misunderstanding. Please note, the rcu_sr_normal_gp_init() 
function does not request any new gp, i.e. our approach does not do any extra GP
requests. It happens only if there are no any dummy-wait-head available as we
discussed it earlier.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/7] rcu: Improve handling of synchronize_rcu() users
  2024-01-03 17:35                     ` Uladzislau Rezki
@ 2024-01-03 17:56                       ` Paul E. McKenney
  2024-01-03 19:02                         ` Uladzislau Rezki
  0 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2024-01-03 17:56 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: RCU, Neeraj upadhyay, Boqun Feng, Hillf Danton, Joel Fernandes,
	LKML, Oleksiy Avramchenko, Frederic Weisbecker

On Wed, Jan 03, 2024 at 06:35:20PM +0100, Uladzislau Rezki wrote:
> On Wed, Jan 03, 2024 at 06:47:30AM -0800, Paul E. McKenney wrote:
> > On Wed, Jan 03, 2024 at 02:16:00PM +0100, Uladzislau Rezki wrote:
> > > On Tue, Jan 02, 2024 at 11:25:13AM -0800, Paul E. McKenney wrote:
> > > > On Tue, Jan 02, 2024 at 01:52:26PM +0100, Uladzislau Rezki wrote:
> > > > > Hello, Paul!
> > > > > 
> > > > > Sorry for late answer, it is because of holidays :)
> > > > > 
> > > > > > > > > The problem is that, we are limited in number of "wait-heads" which we
> > > > > > > > > add as a marker node for this/current grace period. If there are more clients
> > > > > > > > > and there is no a wait-head available it means that a system, the deferred
> > > > > > > > > kworker, is slow in processing callbacks, thus all wait-nodes are in use.
> > > > > > > > > 
> > > > > > > > > That is why we need an extra grace period. Basically to repeat our try one
> > > > > > > > > more time, i.e. it might be that a current grace period is not able to handle
> > > > > > > > > users due to the fact that a system is doing really slow, but this is rather
> > > > > > > > > a corner case and is not a problem.
> > > > > > > > 
> > > > > > > > But in that case, the real issue is not the need for an extra grace
> > > > > > > > period, but rather the need for the wakeup processing to happen, correct?
> > > > > > > > Or am I missing something subtle here?
> > > > > > > > 
> > > > > > > Basically, yes. If we had a spare dummy-node we could process the users
> > > > > > > by the current GP(no need in extra). Why we may not have it - it is because
> > > > > > > like you pointed:
> > > > > > > 
> > > > > > > - wake-up issue, i.e. wake-up time + when we are on_cpu;
> > > > > > > - slow list process. For example priority. The kworker is not
> > > > > > >   given enough CPU time to do the progress, thus "dummy-nodes"
> > > > > > >   are not released in time for reuse.
> > > > > > > 
> > > > > > > Therefore, en extra GP is requested if there is a high flow of
> > > > > > > synchronize_rcu() users and kworker is not able to do a progress
> > > > > > > in time.
> > > > > > > 
> > > > > > > For example 60K+ parallel synchronize_rcu() users will trigger it.
> > > > > > 
> > > > > > OK, but what bad thing would happen if that was moved to precede the
> > > > > > rcu_seq_start(&rcu_state.gp_seq)?  That way, the requested grace period
> > > > > > would be the same as the one that is just now starting.
> > > > > > 
> > > > > > Something like this?
> > > > > > 
> > > > > > 	start_new_poll = rcu_sr_normal_gp_init();
> > > > > > 
> > > > > > 	/* Record GP times before starting GP, hence rcu_seq_start(). */
> > > > > > 	rcu_seq_start(&rcu_state.gp_seq);
> > > > > > 	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > > > > >
> > > > > I had a concern about the case when rcu_sr_normal_gp_init() handles what
> > > > > we currently have, in terms of requests. Right after that there is/are
> > > > > extra sync requests which invoke the start_poll_synchronize_rcu() but
> > > > > since a GP has been requested before it will not request an extra one. So
> > > > > "last" incoming users might not be processed.
> > > > > 
> > > > > That is why i have placed the rcu_sr_normal_gp_init() after a gp_seq is
> > > > > updated.
> > > > > 
> > > > > I can miss something, so please comment. Apart of that we can move it
> > > > > as you proposed.
> > > > 
> > > > Couldn't that possibility be handled by a check in rcu_gp_cleanup()?
> > > > 
> > > It is controlled by the caller anyway, i.e. if a new GP is needed.
> > > 
> > > I am not 100% sure it is as straightforward as it could look like to
> > > handle it in the rcu_sr_normal_gp_cleaup() function. At least i see
> > > that we need to access to the first element of llist and find out if
> > > it is a wait-dummy-head or not. If not we know there are extra incoming
> > > calls.
> > > 
> > > So that way requires extra calling of start_poll_synchronize_rcu().
> > 
> > If this is invoked early enough in rcu_gp_cleanup(), all that needs to
> > happen is to set the need_gp flag.  Plus you can count the number of
> > requests, and snapshot that number at rcu_gp_init() time and check to
> > see if it changed at rcu_gp_cleanup() time.  Later on, this could be
> > used to reduce the number of wakeups, correct?
> > 
> You mean instead of waking-up a gp-kthread just continue processing of
> new users if they are exist? If so, i think, we can implement it as separate
> patches.

Agreed, this is an optimization, and thus should be a separate patch.

> > > I can add a comment about your concern and we can find the best approach
> > > later, if it is OK with you!
> > 
> > I agree that this should be added via a later patch, though I have not
> > yet given up on the possibility that this patch might be simple enough
> > to be later in this same series.
> > 
> Maybe there is a small misunderstanding. Please note, the rcu_sr_normal_gp_init() 
> function does not request any new gp, i.e. our approach does not do any extra GP
> requests. It happens only if there are no any dummy-wait-head available as we
> discussed it earlier.

The start_poll_synchronize_rcu() added by your patch 4/7 will request
an additional grace period because it is invoked after rcu_seq_start()
is called, correct?  Or am I missing something subtle here?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/7] rcu: Improve handling of synchronize_rcu() users
  2024-01-03 17:56                       ` Paul E. McKenney
@ 2024-01-03 19:02                         ` Uladzislau Rezki
  2024-01-03 19:03                           ` Uladzislau Rezki
  0 siblings, 1 reply; 31+ messages in thread
From: Uladzislau Rezki @ 2024-01-03 19:02 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Uladzislau Rezki, RCU, Neeraj upadhyay, Boqun Feng, Hillf Danton,
	Joel Fernandes, LKML, Oleksiy Avramchenko, Frederic Weisbecker

On Wed, Jan 03, 2024 at 09:56:42AM -0800, Paul E. McKenney wrote:
> On Wed, Jan 03, 2024 at 06:35:20PM +0100, Uladzislau Rezki wrote:
> > On Wed, Jan 03, 2024 at 06:47:30AM -0800, Paul E. McKenney wrote:
> > > On Wed, Jan 03, 2024 at 02:16:00PM +0100, Uladzislau Rezki wrote:
> > > > On Tue, Jan 02, 2024 at 11:25:13AM -0800, Paul E. McKenney wrote:
> > > > > On Tue, Jan 02, 2024 at 01:52:26PM +0100, Uladzislau Rezki wrote:
> > > > > > Hello, Paul!
> > > > > > 
> > > > > > Sorry for late answer, it is because of holidays :)
> > > > > > 
> > > > > > > > > > The problem is that, we are limited in number of "wait-heads" which we
> > > > > > > > > > add as a marker node for this/current grace period. If there are more clients
> > > > > > > > > > and there is no a wait-head available it means that a system, the deferred
> > > > > > > > > > kworker, is slow in processing callbacks, thus all wait-nodes are in use.
> > > > > > > > > > 
> > > > > > > > > > That is why we need an extra grace period. Basically to repeat our try one
> > > > > > > > > > more time, i.e. it might be that a current grace period is not able to handle
> > > > > > > > > > users due to the fact that a system is doing really slow, but this is rather
> > > > > > > > > > a corner case and is not a problem.
> > > > > > > > > 
> > > > > > > > > But in that case, the real issue is not the need for an extra grace
> > > > > > > > > period, but rather the need for the wakeup processing to happen, correct?
> > > > > > > > > Or am I missing something subtle here?
> > > > > > > > > 
> > > > > > > > Basically, yes. If we had a spare dummy-node we could process the users
> > > > > > > > by the current GP(no need in extra). Why we may not have it - it is because
> > > > > > > > like you pointed:
> > > > > > > > 
> > > > > > > > - wake-up issue, i.e. wake-up time + when we are on_cpu;
> > > > > > > > - slow list process. For example priority. The kworker is not
> > > > > > > >   given enough CPU time to do the progress, thus "dummy-nodes"
> > > > > > > >   are not released in time for reuse.
> > > > > > > > 
> > > > > > > > Therefore, en extra GP is requested if there is a high flow of
> > > > > > > > synchronize_rcu() users and kworker is not able to do a progress
> > > > > > > > in time.
> > > > > > > > 
> > > > > > > > For example 60K+ parallel synchronize_rcu() users will trigger it.
> > > > > > > 
> > > > > > > OK, but what bad thing would happen if that was moved to precede the
> > > > > > > rcu_seq_start(&rcu_state.gp_seq)?  That way, the requested grace period
> > > > > > > would be the same as the one that is just now starting.
> > > > > > > 
> > > > > > > Something like this?
> > > > > > > 
> > > > > > > 	start_new_poll = rcu_sr_normal_gp_init();
> > > > > > > 
> > > > > > > 	/* Record GP times before starting GP, hence rcu_seq_start(). */
> > > > > > > 	rcu_seq_start(&rcu_state.gp_seq);
> > > > > > > 	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > > > > > >
> > > > > > I had a concern about the case when rcu_sr_normal_gp_init() handles what
> > > > > > we currently have, in terms of requests. Right after that there is/are
> > > > > > extra sync requests which invoke the start_poll_synchronize_rcu() but
> > > > > > since a GP has been requested before it will not request an extra one. So
> > > > > > "last" incoming users might not be processed.
> > > > > > 
> > > > > > That is why i have placed the rcu_sr_normal_gp_init() after a gp_seq is
> > > > > > updated.
> > > > > > 
> > > > > > I can miss something, so please comment. Apart of that we can move it
> > > > > > as you proposed.
> > > > > 
> > > > > Couldn't that possibility be handled by a check in rcu_gp_cleanup()?
> > > > > 
> > > > It is controlled by the caller anyway, i.e. if a new GP is needed.
> > > > 
> > > > I am not 100% sure it is as straightforward as it could look like to
> > > > handle it in the rcu_sr_normal_gp_cleaup() function. At least i see
> > > > that we need to access to the first element of llist and find out if
> > > > it is a wait-dummy-head or not. If not we know there are extra incoming
> > > > calls.
> > > > 
> > > > So that way requires extra calling of start_poll_synchronize_rcu().
> > > 
> > > If this is invoked early enough in rcu_gp_cleanup(), all that needs to
> > > happen is to set the need_gp flag.  Plus you can count the number of
> > > requests, and snapshot that number at rcu_gp_init() time and check to
> > > see if it changed at rcu_gp_cleanup() time.  Later on, this could be
> > > used to reduce the number of wakeups, correct?
> > > 
> > You mean instead of waking-up a gp-kthread just continue processing of
> > new users if they are exist? If so, i think, we can implement it as separate
> > patches.
> 
> Agreed, this is an optimization, and thus should be a separate patch.
> 
> > > > I can add a comment about your concern and we can find the best approach
> > > > later, if it is OK with you!
> > > 
> > > I agree that this should be added via a later patch, though I have not
> > > yet given up on the possibility that this patch might be simple enough
> > > to be later in this same series.
> > > 
> > Maybe there is a small misunderstanding. Please note, the rcu_sr_normal_gp_init() 
> > function does not request any new gp, i.e. our approach does not do any extra GP
> > requests. It happens only if there are no any dummy-wait-head available as we
> > discussed it earlier.
> 
> The start_poll_synchronize_rcu() added by your patch 4/7 will request
> an additional grace period because it is invoked after rcu_seq_start()
> is called, correct?  Or am I missing something subtle here?
> 
<snip>
+       // New poll request after rnp unlock
+       if (start_new_poll)
+               (void) start_poll_synchronize_rcu();
+
<snip>

The "start_new_poll" is set to "true" only when _this_ GP is not able
to handle anything and there are outstanding users. It happens when the
rcu_sr_normal_gp_init() function was not able to insert a dummy separator
to the llist, because there were no left dummy-nodes(fixed number of them)
due to the fact that all of them are "in-use". The reason why there are no
dummy-nodes is because of slow progress because it is done by dedicated
kworker.

I can trigger it, i mean when we need an addition GP, start_new_pool is 1,
only when i run 20 000 processes concurrently in a tight loop:

<snip>
while (1)
  synchronize_rcu();
<snip>

in that scenario we start to ask for an addition GP because we are not up
to speed, i.e. a system is slow in processing callbacks and we need some
time until wait-node/nodes is/are released for reuse.

We need a next GP to move it forward, i.e. to repeat a try of attaching
a dummy-node.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/7] rcu: Improve handling of synchronize_rcu() users
  2024-01-03 19:02                         ` Uladzislau Rezki
@ 2024-01-03 19:03                           ` Uladzislau Rezki
  2024-01-03 19:33                             ` Paul E. McKenney
  0 siblings, 1 reply; 31+ messages in thread
From: Uladzislau Rezki @ 2024-01-03 19:03 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Paul E. McKenney, RCU, Neeraj upadhyay, Boqun Feng, Hillf Danton,
	Joel Fernandes, LKML, Oleksiy Avramchenko, Frederic Weisbecker

On Wed, Jan 03, 2024 at 08:02:00PM +0100, Uladzislau Rezki wrote:
> On Wed, Jan 03, 2024 at 09:56:42AM -0800, Paul E. McKenney wrote:
> > On Wed, Jan 03, 2024 at 06:35:20PM +0100, Uladzislau Rezki wrote:
> > > On Wed, Jan 03, 2024 at 06:47:30AM -0800, Paul E. McKenney wrote:
> > > > On Wed, Jan 03, 2024 at 02:16:00PM +0100, Uladzislau Rezki wrote:
> > > > > On Tue, Jan 02, 2024 at 11:25:13AM -0800, Paul E. McKenney wrote:
> > > > > > On Tue, Jan 02, 2024 at 01:52:26PM +0100, Uladzislau Rezki wrote:
> > > > > > > Hello, Paul!
> > > > > > > 
> > > > > > > Sorry for late answer, it is because of holidays :)
> > > > > > > 
> > > > > > > > > > > The problem is that, we are limited in number of "wait-heads" which we
> > > > > > > > > > > add as a marker node for this/current grace period. If there are more clients
> > > > > > > > > > > and there is no a wait-head available it means that a system, the deferred
> > > > > > > > > > > kworker, is slow in processing callbacks, thus all wait-nodes are in use.
> > > > > > > > > > > 
> > > > > > > > > > > That is why we need an extra grace period. Basically to repeat our try one
> > > > > > > > > > > more time, i.e. it might be that a current grace period is not able to handle
> > > > > > > > > > > users due to the fact that a system is doing really slow, but this is rather
> > > > > > > > > > > a corner case and is not a problem.
> > > > > > > > > > 
> > > > > > > > > > But in that case, the real issue is not the need for an extra grace
> > > > > > > > > > period, but rather the need for the wakeup processing to happen, correct?
> > > > > > > > > > Or am I missing something subtle here?
> > > > > > > > > > 
> > > > > > > > > Basically, yes. If we had a spare dummy-node we could process the users
> > > > > > > > > by the current GP(no need in extra). Why we may not have it - it is because
> > > > > > > > > like you pointed:
> > > > > > > > > 
> > > > > > > > > - wake-up issue, i.e. wake-up time + when we are on_cpu;
> > > > > > > > > - slow list process. For example priority. The kworker is not
> > > > > > > > >   given enough CPU time to do the progress, thus "dummy-nodes"
> > > > > > > > >   are not released in time for reuse.
> > > > > > > > > 
> > > > > > > > > Therefore, en extra GP is requested if there is a high flow of
> > > > > > > > > synchronize_rcu() users and kworker is not able to do a progress
> > > > > > > > > in time.
> > > > > > > > > 
> > > > > > > > > For example 60K+ parallel synchronize_rcu() users will trigger it.
> > > > > > > > 
> > > > > > > > OK, but what bad thing would happen if that was moved to precede the
> > > > > > > > rcu_seq_start(&rcu_state.gp_seq)?  That way, the requested grace period
> > > > > > > > would be the same as the one that is just now starting.
> > > > > > > > 
> > > > > > > > Something like this?
> > > > > > > > 
> > > > > > > > 	start_new_poll = rcu_sr_normal_gp_init();
> > > > > > > > 
> > > > > > > > 	/* Record GP times before starting GP, hence rcu_seq_start(). */
> > > > > > > > 	rcu_seq_start(&rcu_state.gp_seq);
> > > > > > > > 	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > > > > > > >
> > > > > > > I had a concern about the case when rcu_sr_normal_gp_init() handles what
> > > > > > > we currently have, in terms of requests. Right after that there is/are
> > > > > > > extra sync requests which invoke the start_poll_synchronize_rcu() but
> > > > > > > since a GP has been requested before it will not request an extra one. So
> > > > > > > "last" incoming users might not be processed.
> > > > > > > 
> > > > > > > That is why i have placed the rcu_sr_normal_gp_init() after a gp_seq is
> > > > > > > updated.
> > > > > > > 
> > > > > > > I can miss something, so please comment. Apart of that we can move it
> > > > > > > as you proposed.
> > > > > > 
> > > > > > Couldn't that possibility be handled by a check in rcu_gp_cleanup()?
> > > > > > 
> > > > > It is controlled by the caller anyway, i.e. if a new GP is needed.
> > > > > 
> > > > > I am not 100% sure it is as straightforward as it could look like to
> > > > > handle it in the rcu_sr_normal_gp_cleaup() function. At least i see
> > > > > that we need to access to the first element of llist and find out if
> > > > > it is a wait-dummy-head or not. If not we know there are extra incoming
> > > > > calls.
> > > > > 
> > > > > So that way requires extra calling of start_poll_synchronize_rcu().
> > > > 
> > > > If this is invoked early enough in rcu_gp_cleanup(), all that needs to
> > > > happen is to set the need_gp flag.  Plus you can count the number of
> > > > requests, and snapshot that number at rcu_gp_init() time and check to
> > > > see if it changed at rcu_gp_cleanup() time.  Later on, this could be
> > > > used to reduce the number of wakeups, correct?
> > > > 
> > > You mean instead of waking-up a gp-kthread just continue processing of
> > > new users if they are exist? If so, i think, we can implement it as separate
> > > patches.
> > 
> > Agreed, this is an optimization, and thus should be a separate patch.
> > 
> > > > > I can add a comment about your concern and we can find the best approach
> > > > > later, if it is OK with you!
> > > > 
> > > > I agree that this should be added via a later patch, though I have not
> > > > yet given up on the possibility that this patch might be simple enough
> > > > to be later in this same series.
> > > > 
> > > Maybe there is a small misunderstanding. Please note, the rcu_sr_normal_gp_init() 
> > > function does not request any new gp, i.e. our approach does not do any extra GP
> > > requests. It happens only if there are no any dummy-wait-head available as we
> > > discussed it earlier.
> > 
> > The start_poll_synchronize_rcu() added by your patch 4/7 will request
> > an additional grace period because it is invoked after rcu_seq_start()
> > is called, correct?  Or am I missing something subtle here?
> > 
> <snip>
> +       // New poll request after rnp unlock
> +       if (start_new_poll)
> +               (void) start_poll_synchronize_rcu();
> +
> <snip>
> 
> The "start_new_poll" is set to "true" only when _this_ GP is not able
> to handle anything and there are outstanding users. It happens when the
> rcu_sr_normal_gp_init() function was not able to insert a dummy separator
> to the llist, because there were no left dummy-nodes(fixed number of them)
> due to the fact that all of them are "in-use". The reason why there are no
> dummy-nodes is because of slow progress because it is done by dedicated
> kworker.
> 
> I can trigger it, i mean when we need an addition GP, start_new_pool is 1,
> only when i run 20 000 processes concurrently in a tight loop:
> 
> <snip>
> while (1)
>   synchronize_rcu();
> <snip>
> 
> in that scenario we start to ask for an addition GP because we are not up
> to speed, i.e. a system is slow in processing callbacks and we need some
> time until wait-node/nodes is/are released for reuse.
> 
> We need a next GP to move it forward, i.e. to repeat a try of attaching
> a dummy-node.
> 
Probably i should add a comment about it :)

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/7] rcu: Improve handling of synchronize_rcu() users
  2024-01-03 19:03                           ` Uladzislau Rezki
@ 2024-01-03 19:33                             ` Paul E. McKenney
  2024-01-04 11:17                               ` Uladzislau Rezki
  0 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2024-01-03 19:33 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: RCU, Neeraj upadhyay, Boqun Feng, Hillf Danton, Joel Fernandes,
	LKML, Oleksiy Avramchenko, Frederic Weisbecker

On Wed, Jan 03, 2024 at 08:03:10PM +0100, Uladzislau Rezki wrote:
> On Wed, Jan 03, 2024 at 08:02:00PM +0100, Uladzislau Rezki wrote:
> > On Wed, Jan 03, 2024 at 09:56:42AM -0800, Paul E. McKenney wrote:
> > > On Wed, Jan 03, 2024 at 06:35:20PM +0100, Uladzislau Rezki wrote:
> > > > On Wed, Jan 03, 2024 at 06:47:30AM -0800, Paul E. McKenney wrote:
> > > > > On Wed, Jan 03, 2024 at 02:16:00PM +0100, Uladzislau Rezki wrote:
> > > > > > On Tue, Jan 02, 2024 at 11:25:13AM -0800, Paul E. McKenney wrote:
> > > > > > > On Tue, Jan 02, 2024 at 01:52:26PM +0100, Uladzislau Rezki wrote:
> > > > > > > > Hello, Paul!
> > > > > > > > 
> > > > > > > > Sorry for late answer, it is because of holidays :)
> > > > > > > > 
> > > > > > > > > > > > The problem is that, we are limited in number of "wait-heads" which we
> > > > > > > > > > > > add as a marker node for this/current grace period. If there are more clients
> > > > > > > > > > > > and there is no a wait-head available it means that a system, the deferred
> > > > > > > > > > > > kworker, is slow in processing callbacks, thus all wait-nodes are in use.
> > > > > > > > > > > > 
> > > > > > > > > > > > That is why we need an extra grace period. Basically to repeat our try one
> > > > > > > > > > > > more time, i.e. it might be that a current grace period is not able to handle
> > > > > > > > > > > > users due to the fact that a system is doing really slow, but this is rather
> > > > > > > > > > > > a corner case and is not a problem.
> > > > > > > > > > > 
> > > > > > > > > > > But in that case, the real issue is not the need for an extra grace
> > > > > > > > > > > period, but rather the need for the wakeup processing to happen, correct?
> > > > > > > > > > > Or am I missing something subtle here?
> > > > > > > > > > > 
> > > > > > > > > > Basically, yes. If we had a spare dummy-node we could process the users
> > > > > > > > > > by the current GP(no need in extra). Why we may not have it - it is because
> > > > > > > > > > like you pointed:
> > > > > > > > > > 
> > > > > > > > > > - wake-up issue, i.e. wake-up time + when we are on_cpu;
> > > > > > > > > > - slow list process. For example priority. The kworker is not
> > > > > > > > > >   given enough CPU time to do the progress, thus "dummy-nodes"
> > > > > > > > > >   are not released in time for reuse.
> > > > > > > > > > 
> > > > > > > > > > Therefore, en extra GP is requested if there is a high flow of
> > > > > > > > > > synchronize_rcu() users and kworker is not able to do a progress
> > > > > > > > > > in time.
> > > > > > > > > > 
> > > > > > > > > > For example 60K+ parallel synchronize_rcu() users will trigger it.
> > > > > > > > > 
> > > > > > > > > OK, but what bad thing would happen if that was moved to precede the
> > > > > > > > > rcu_seq_start(&rcu_state.gp_seq)?  That way, the requested grace period
> > > > > > > > > would be the same as the one that is just now starting.
> > > > > > > > > 
> > > > > > > > > Something like this?
> > > > > > > > > 
> > > > > > > > > 	start_new_poll = rcu_sr_normal_gp_init();
> > > > > > > > > 
> > > > > > > > > 	/* Record GP times before starting GP, hence rcu_seq_start(). */
> > > > > > > > > 	rcu_seq_start(&rcu_state.gp_seq);
> > > > > > > > > 	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > > > > > > > >
> > > > > > > > I had a concern about the case when rcu_sr_normal_gp_init() handles what
> > > > > > > > we currently have, in terms of requests. Right after that there is/are
> > > > > > > > extra sync requests which invoke the start_poll_synchronize_rcu() but
> > > > > > > > since a GP has been requested before it will not request an extra one. So
> > > > > > > > "last" incoming users might not be processed.
> > > > > > > > 
> > > > > > > > That is why i have placed the rcu_sr_normal_gp_init() after a gp_seq is
> > > > > > > > updated.
> > > > > > > > 
> > > > > > > > I can miss something, so please comment. Apart of that we can move it
> > > > > > > > as you proposed.
> > > > > > > 
> > > > > > > Couldn't that possibility be handled by a check in rcu_gp_cleanup()?
> > > > > > > 
> > > > > > It is controlled by the caller anyway, i.e. if a new GP is needed.
> > > > > > 
> > > > > > I am not 100% sure it is as straightforward as it could look like to
> > > > > > handle it in the rcu_sr_normal_gp_cleaup() function. At least i see
> > > > > > that we need to access to the first element of llist and find out if
> > > > > > it is a wait-dummy-head or not. If not we know there are extra incoming
> > > > > > calls.
> > > > > > 
> > > > > > So that way requires extra calling of start_poll_synchronize_rcu().
> > > > > 
> > > > > If this is invoked early enough in rcu_gp_cleanup(), all that needs to
> > > > > happen is to set the need_gp flag.  Plus you can count the number of
> > > > > requests, and snapshot that number at rcu_gp_init() time and check to
> > > > > see if it changed at rcu_gp_cleanup() time.  Later on, this could be
> > > > > used to reduce the number of wakeups, correct?
> > > > > 
> > > > You mean instead of waking-up a gp-kthread just continue processing of
> > > > new users if they are exist? If so, i think, we can implement it as separate
> > > > patches.
> > > 
> > > Agreed, this is an optimization, and thus should be a separate patch.
> > > 
> > > > > > I can add a comment about your concern and we can find the best approach
> > > > > > later, if it is OK with you!
> > > > > 
> > > > > I agree that this should be added via a later patch, though I have not
> > > > > yet given up on the possibility that this patch might be simple enough
> > > > > to be later in this same series.
> > > > > 
> > > > Maybe there is a small misunderstanding. Please note, the rcu_sr_normal_gp_init() 
> > > > function does not request any new gp, i.e. our approach does not do any extra GP
> > > > requests. It happens only if there are no any dummy-wait-head available as we
> > > > discussed it earlier.
> > > 
> > > The start_poll_synchronize_rcu() added by your patch 4/7 will request
> > > an additional grace period because it is invoked after rcu_seq_start()
> > > is called, correct?  Or am I missing something subtle here?
> > > 
> > <snip>
> > +       // New poll request after rnp unlock
> > +       if (start_new_poll)
> > +               (void) start_poll_synchronize_rcu();
> > +
> > <snip>
> > 
> > The "start_new_poll" is set to "true" only when _this_ GP is not able
> > to handle anything and there are outstanding users. It happens when the
> > rcu_sr_normal_gp_init() function was not able to insert a dummy separator
> > to the llist, because there were no left dummy-nodes(fixed number of them)
> > due to the fact that all of them are "in-use". The reason why there are no
> > dummy-nodes is because of slow progress because it is done by dedicated
> > kworker.
> > 
> > I can trigger it, i mean when we need an addition GP, start_new_pool is 1,
> > only when i run 20 000 processes concurrently in a tight loop:
> > 
> > <snip>
> > while (1)
> >   synchronize_rcu();
> > <snip>
> > 
> > in that scenario we start to ask for an addition GP because we are not up
> > to speed, i.e. a system is slow in processing callbacks and we need some
> > time until wait-node/nodes is/are released for reuse.
> > 
> > We need a next GP to move it forward, i.e. to repeat a try of attaching
> > a dummy-node.
> > 
> Probably i should add a comment about it :)

Sounds good, and thank you for bearing with me!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/7] rcu: Improve handling of synchronize_rcu() users
  2024-01-03 19:33                             ` Paul E. McKenney
@ 2024-01-04 11:17                               ` Uladzislau Rezki
  0 siblings, 0 replies; 31+ messages in thread
From: Uladzislau Rezki @ 2024-01-04 11:17 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Uladzislau Rezki, RCU, Neeraj upadhyay, Boqun Feng, Hillf Danton,
	Joel Fernandes, LKML, Oleksiy Avramchenko, Frederic Weisbecker

On Wed, Jan 03, 2024 at 11:33:01AM -0800, Paul E. McKenney wrote:
> On Wed, Jan 03, 2024 at 08:03:10PM +0100, Uladzislau Rezki wrote:
> > On Wed, Jan 03, 2024 at 08:02:00PM +0100, Uladzislau Rezki wrote:
> > > On Wed, Jan 03, 2024 at 09:56:42AM -0800, Paul E. McKenney wrote:
> > > > On Wed, Jan 03, 2024 at 06:35:20PM +0100, Uladzislau Rezki wrote:
> > > > > On Wed, Jan 03, 2024 at 06:47:30AM -0800, Paul E. McKenney wrote:
> > > > > > On Wed, Jan 03, 2024 at 02:16:00PM +0100, Uladzislau Rezki wrote:
> > > > > > > On Tue, Jan 02, 2024 at 11:25:13AM -0800, Paul E. McKenney wrote:
> > > > > > > > On Tue, Jan 02, 2024 at 01:52:26PM +0100, Uladzislau Rezki wrote:
> > > > > > > > > Hello, Paul!
> > > > > > > > > 
> > > > > > > > > Sorry for late answer, it is because of holidays :)
> > > > > > > > > 
> > > > > > > > > > > > > The problem is that, we are limited in number of "wait-heads" which we
> > > > > > > > > > > > > add as a marker node for this/current grace period. If there are more clients
> > > > > > > > > > > > > and there is no a wait-head available it means that a system, the deferred
> > > > > > > > > > > > > kworker, is slow in processing callbacks, thus all wait-nodes are in use.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > That is why we need an extra grace period. Basically to repeat our try one
> > > > > > > > > > > > > more time, i.e. it might be that a current grace period is not able to handle
> > > > > > > > > > > > > users due to the fact that a system is doing really slow, but this is rather
> > > > > > > > > > > > > a corner case and is not a problem.
> > > > > > > > > > > > 
> > > > > > > > > > > > But in that case, the real issue is not the need for an extra grace
> > > > > > > > > > > > period, but rather the need for the wakeup processing to happen, correct?
> > > > > > > > > > > > Or am I missing something subtle here?
> > > > > > > > > > > > 
> > > > > > > > > > > Basically, yes. If we had a spare dummy-node we could process the users
> > > > > > > > > > > by the current GP(no need in extra). Why we may not have it - it is because
> > > > > > > > > > > like you pointed:
> > > > > > > > > > > 
> > > > > > > > > > > - wake-up issue, i.e. wake-up time + when we are on_cpu;
> > > > > > > > > > > - slow list process. For example priority. The kworker is not
> > > > > > > > > > >   given enough CPU time to do the progress, thus "dummy-nodes"
> > > > > > > > > > >   are not released in time for reuse.
> > > > > > > > > > > 
> > > > > > > > > > > Therefore, en extra GP is requested if there is a high flow of
> > > > > > > > > > > synchronize_rcu() users and kworker is not able to do a progress
> > > > > > > > > > > in time.
> > > > > > > > > > > 
> > > > > > > > > > > For example 60K+ parallel synchronize_rcu() users will trigger it.
> > > > > > > > > > 
> > > > > > > > > > OK, but what bad thing would happen if that was moved to precede the
> > > > > > > > > > rcu_seq_start(&rcu_state.gp_seq)?  That way, the requested grace period
> > > > > > > > > > would be the same as the one that is just now starting.
> > > > > > > > > > 
> > > > > > > > > > Something like this?
> > > > > > > > > > 
> > > > > > > > > > 	start_new_poll = rcu_sr_normal_gp_init();
> > > > > > > > > > 
> > > > > > > > > > 	/* Record GP times before starting GP, hence rcu_seq_start(). */
> > > > > > > > > > 	rcu_seq_start(&rcu_state.gp_seq);
> > > > > > > > > > 	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > > > > > > > > >
> > > > > > > > > I had a concern about the case when rcu_sr_normal_gp_init() handles what
> > > > > > > > > we currently have, in terms of requests. Right after that there is/are
> > > > > > > > > extra sync requests which invoke the start_poll_synchronize_rcu() but
> > > > > > > > > since a GP has been requested before it will not request an extra one. So
> > > > > > > > > "last" incoming users might not be processed.
> > > > > > > > > 
> > > > > > > > > That is why i have placed the rcu_sr_normal_gp_init() after a gp_seq is
> > > > > > > > > updated.
> > > > > > > > > 
> > > > > > > > > I can miss something, so please comment. Apart of that we can move it
> > > > > > > > > as you proposed.
> > > > > > > > 
> > > > > > > > Couldn't that possibility be handled by a check in rcu_gp_cleanup()?
> > > > > > > > 
> > > > > > > It is controlled by the caller anyway, i.e. if a new GP is needed.
> > > > > > > 
> > > > > > > I am not 100% sure it is as straightforward as it could look like to
> > > > > > > handle it in the rcu_sr_normal_gp_cleaup() function. At least i see
> > > > > > > that we need to access to the first element of llist and find out if
> > > > > > > it is a wait-dummy-head or not. If not we know there are extra incoming
> > > > > > > calls.
> > > > > > > 
> > > > > > > So that way requires extra calling of start_poll_synchronize_rcu().
> > > > > > 
> > > > > > If this is invoked early enough in rcu_gp_cleanup(), all that needs to
> > > > > > happen is to set the need_gp flag.  Plus you can count the number of
> > > > > > requests, and snapshot that number at rcu_gp_init() time and check to
> > > > > > see if it changed at rcu_gp_cleanup() time.  Later on, this could be
> > > > > > used to reduce the number of wakeups, correct?
> > > > > > 
> > > > > You mean instead of waking-up a gp-kthread just continue processing of
> > > > > new users if they are exist? If so, i think, we can implement it as separate
> > > > > patches.
> > > > 
> > > > Agreed, this is an optimization, and thus should be a separate patch.
> > > > 
> > > > > > > I can add a comment about your concern and we can find the best approach
> > > > > > > later, if it is OK with you!
> > > > > > 
> > > > > > I agree that this should be added via a later patch, though I have not
> > > > > > yet given up on the possibility that this patch might be simple enough
> > > > > > to be later in this same series.
> > > > > > 
> > > > > Maybe there is a small misunderstanding. Please note, the rcu_sr_normal_gp_init() 
> > > > > function does not request any new gp, i.e. our approach does not do any extra GP
> > > > > requests. It happens only if there are no any dummy-wait-head available as we
> > > > > discussed it earlier.
> > > > 
> > > > The start_poll_synchronize_rcu() added by your patch 4/7 will request
> > > > an additional grace period because it is invoked after rcu_seq_start()
> > > > is called, correct?  Or am I missing something subtle here?
> > > > 
> > > <snip>
> > > +       // New poll request after rnp unlock
> > > +       if (start_new_poll)
> > > +               (void) start_poll_synchronize_rcu();
> > > +
> > > <snip>
> > > 
> > > The "start_new_poll" is set to "true" only when _this_ GP is not able
> > > to handle anything and there are outstanding users. It happens when the
> > > rcu_sr_normal_gp_init() function was not able to insert a dummy separator
> > > to the llist, because there were no left dummy-nodes(fixed number of them)
> > > due to the fact that all of them are "in-use". The reason why there are no
> > > dummy-nodes is because of slow progress because it is done by dedicated
> > > kworker.
> > > 
> > > I can trigger it, i mean when we need an addition GP, start_new_pool is 1,
> > > only when i run 20 000 processes concurrently in a tight loop:
> > > 
> > > <snip>
> > > while (1)
> > >   synchronize_rcu();
> > > <snip>
> > > 
> > > in that scenario we start to ask for an addition GP because we are not up
> > > to speed, i.e. a system is slow in processing callbacks and we need some
> > > time until wait-node/nodes is/are released for reuse.
> > > 
> > > We need a next GP to move it forward, i.e. to repeat a try of attaching
> > > a dummy-node.
> > > 
> > Probably i should add a comment about it :)
> 
> Sounds good, and thank you for bearing with me!
> 
Thanks to you :)

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2024-01-04 11:18 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-28  8:00 [PATCH v3 0/7] Reduce synchronize_rcu() latency(V3) Uladzislau Rezki (Sony)
2023-11-28  8:00 ` [PATCH v3 1/7] rcu: Reduce synchronize_rcu() latency Uladzislau Rezki (Sony)
2023-11-28  8:00 ` [PATCH v3 2/7] rcu: Add a trace event for synchronize_rcu_normal() Uladzislau Rezki (Sony)
2023-11-28  8:00 ` [PATCH v3 3/7] doc: Add rcutree.rcu_normal_wake_from_gp to kernel-parameters.txt Uladzislau Rezki (Sony)
2023-12-20  1:17   ` Paul E. McKenney
2023-12-21 10:28     ` Uladzislau Rezki
2023-11-28  8:00 ` [PATCH v3 4/7] rcu: Improve handling of synchronize_rcu() users Uladzislau Rezki (Sony)
2023-12-20  1:37   ` Paul E. McKenney
2023-12-21 10:52     ` Uladzislau Rezki
2023-12-21 18:40       ` Paul E. McKenney
2023-12-22  9:27         ` Uladzislau Rezki
2023-12-22 18:58           ` Paul E. McKenney
2024-01-02 12:52             ` Uladzislau Rezki
2024-01-02 19:25               ` Paul E. McKenney
2024-01-03 13:16                 ` Uladzislau Rezki
2024-01-03 14:47                   ` Paul E. McKenney
2024-01-03 17:35                     ` Uladzislau Rezki
2024-01-03 17:56                       ` Paul E. McKenney
2024-01-03 19:02                         ` Uladzislau Rezki
2024-01-03 19:03                           ` Uladzislau Rezki
2024-01-03 19:33                             ` Paul E. McKenney
2024-01-04 11:17                               ` Uladzislau Rezki
2023-11-28  8:00 ` [PATCH v3 5/7] rcu: Support direct wake-up " Uladzislau Rezki (Sony)
2023-12-20  1:46   ` Paul E. McKenney
2023-12-21 11:22     ` Uladzislau Rezki
2023-11-28  8:00 ` [PATCH v3 6/7] rcu: Move sync related data to rcu_state structure Uladzislau Rezki (Sony)
2023-12-20  1:47   ` Paul E. McKenney
2023-12-21 10:56     ` Uladzislau Rezki
2023-11-28  8:00 ` [PATCH v3 7/7] rcu: Add CONFIG_RCU_SR_NORMAL_DEBUG_GP Uladzislau Rezki (Sony)
2023-12-20  1:14   ` Paul E. McKenney
2023-12-21 10:27     ` Uladzislau Rezki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox