Re: [PATCH v3 1/1] rcu: Reduce synchronize_rcu() waiting time

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Uladzislau Rezki <urezki@gmail.com>
To: Joel Fernandes <joel@joelfernandes.org>
Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com>,
	"Paul E . McKenney" <paulmck@kernel.org>,
	RCU <rcu@vger.kernel.org>,
	Neeraj upadhyay <Neeraj.Upadhyay@amd.com>,
	Boqun Feng <boqun.feng@gmail.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>,
	Frederic Weisbecker <frederic@kernel.org>
Subject: Re: [PATCH v3 1/1] rcu: Reduce synchronize_rcu() waiting time
Date: Tue, 17 Oct 2023 16:06:29 +0200	[thread overview]
Message-ID: <ZS6U5SgvGcmdE_DA@pc636> (raw)
In-Reply-To: <CAEXW_YRfuXqnBFN=DpOLio74j8fX3eEDSFCH8LXyavuHDdYysA@mail.gmail.com>

Hello, Joel!

> Hello, Vlad,
> 
> Looks like a nice patch, I am still looking at this more and just
> started, but a few small comments:
> 
Thank you for looking at it :) And thank you for the comments.


> On Mon, Oct 16, 2023 at 1:30 PM Uladzislau Rezki (Sony)
> <urezki@gmail.com> wrote:
> >
> > A call to a synchronize_rcu() can be optimized from time point of
> > view. Different workloads can be affected by this especially the
> > ones which use this API in its time critical sections.
> >
> > For example if CONFIG_RCU_NOCB_CPU is set, the wakeme_after_rcu()
> > callback can be delayed and such delay depends on where in a nocb
> > list it is located.
> >
> > 1. On our Android devices i can easily trigger the scenario when
> > it is a last in the list out of ~3600 callbacks:
> >
> > <snip>
> >   <...>-29      [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28
> > ...
> >   <...>-29      [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt
> >   <...>-29      [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt
> >   <...>-29      [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt
> >   <...>-29      [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt
> >   <...>-29      [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt
> >   <...>-29      [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt
> >   <...>-29      [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=....
> > <snip>
> >
> > 2. We use cpuset/cgroup to classify tasks and assign them into
> > different cgroups. For example "backgrond" group which binds tasks
> > only to little CPUs or "foreground" which makes use of all CPUs.
> > Tasks can be migrated between groups by a request if an acceleration
> > is needed.
> >
> > See below an example how "surfaceflinger" task gets migrated.
> > Initially it is located in the "system-background" cgroup which
> > allows to run only on little cores. In order to speed it up it
> > can be temporary moved into "foreground" cgroup which allows
> > to use big/all CPUs:
> >
> [...]
> > 3. To address this drawback, maintain a separate track that consists
> > of synchronize_rcu() callers only. The GP-kthread, that drivers a GP
> > either wake-ups a worker to drain all list or directly wakes-up end
> > user if it is one in the drain list.
> 
> I wonder if there is a cut-off N, where waking up N ("a few") inline
> instead of just 1, yields similar results. For small values of N, that
> keeps the total number of wakeups lower than pushing off to a kworker.
> For instance, if 2 users, then you get 3 wakeups instead of just 2 (1
> for the kworker and another 2 for the synchronize). But if you had a
> cut off like N=5, then 2 users results in just 2 wakeups.
> I don't feel too strongly about it, but for small values of N, I am
> interested in a slightly better optimization if we can squeeze it.
> 
This makes sense to me. We can add a threshold to wake-up several users.
Like you mentioned. We can do 2 wake-ups instead of 3.

> [...]
> > + * There are three lists for handling synchronize_rcu() users.
> > + * A first list corresponds to new coming users, second for users
> > + * which wait for a grace period and third is for which a grace
> > + * period is passed.
> > + */
> > +static struct sr_normal_state {
> > +       struct llist_head curr; /* request a GP users. */
> > +       struct llist_head wait; /* wait for GP users. */
> > +       struct llist_head done; /* ready for GP users. */
> > +       struct llist_node *curr_tail;
> > +       struct llist_node *wait_tail;
> 
> Just for clarification, the only reason you need 'tail' is because you
> can use llist_add_batch() to quickly splice the list, instead of
> finding the tail each time (expensive for a large list), correct? It
> would be good to mention that in a comment here.
> 
Right. I do not want to scan the list. I will comment it.

> > +       atomic_t active;
> > +} sr;
> > +
> > +/* Disable it by default. */
> > +static int rcu_normal_wake_from_gp = 0;
> > +module_param(rcu_normal_wake_from_gp, int, 0644);
> > +
> > +static void rcu_sr_normal_complete(struct llist_node *node)
> > +{
> > +       struct rcu_synchronize *rs = container_of(
> > +               (struct rcu_head *) node, struct rcu_synchronize, head);
> > +       unsigned long oldstate = (unsigned long) rs->head.func;
> > +
> > +       if (!poll_state_synchronize_rcu(oldstate))
> > +               WARN_ONCE(1, "A full grace period is not passed yet: %lu",
> > +                       rcu_seq_diff(get_state_synchronize_rcu(), oldstate));
> 
> nit: WARN_ONCE(!poll_state_synchronize_rcu(oldstate)), ...) and get
> rid of if() ?
>
Initially i had it written as you proposed i reworked it because i
wanted to see the delta. I do not have a strong opinion, so i can
fix it as you proposed.

> 
> > +
> > +       /* Finally. */
> > +       complete(&rs->completion);
> > +}
> > +
> > +static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
> > +{
> > +       struct llist_node *done, *rcu, *next;
> > +
> > +       done = llist_del_all(&sr.done);
> > +       if (!done)
> > +               return;
> > +
> > +       llist_for_each_safe(rcu, next, done)
> > +               rcu_sr_normal_complete(rcu);
> > +}
> [...]
> > +static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
> > +{
> > +       atomic_inc(&sr.active);
> > +       if (llist_add((struct llist_node *) &rs->head, &sr.curr))
> > +               /* Set the tail. Only first and one user can do that. */
> > +               WRITE_ONCE(sr.curr_tail, (struct llist_node *) &rs->head);
> > +       atomic_dec(&sr.active);
> 
> Here there is no memory ordering provided by the atomic ops. Is that really Ok?
> 
This needs to be reworked since there is no ordering guaranteed. I think
there is a version of "atomic_inc_something" that guarantees it?

> And same question for other usages of .active.
> 
> Per: https://www.kernel.org/doc/Documentation/atomic_t.txt
> "RMW operations that have no return value are unordered;"
> --
> Note to self to ping my Android friends as well about this improvement
> :-). Especially the -mm Android folks are interested in app startup
> time.
> 
That is good. Thank you :)

--
Uladzislau Rezki

next prev parent reply	other threads:[~2023-10-17 14:06 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-16 17:30 [PATCH v3 1/1] rcu: Reduce synchronize_rcu() waiting time Uladzislau Rezki (Sony)
2023-10-17  1:28 ` Joel Fernandes
2023-10-17 14:06   ` Uladzislau Rezki [this message]
2023-10-18 14:32     ` Joel Fernandes
2023-10-18 17:36       ` Uladzislau Rezki
     [not found] ` <20231017103342.1879-1-hdanton@sina.com>
2023-10-18 11:20   ` Uladzislau Rezki
     [not found]   ` <20231019114432.1995-1-hdanton@sina.com>
2023-10-19 15:44     ` Uladzislau Rezki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZS6U5SgvGcmdE_DA@pc636 \
    --to=urezki@gmail.com \
    --cc=Neeraj.Upadhyay@amd.com \
    --cc=boqun.feng@gmail.com \
    --cc=frederic@kernel.org \
    --cc=joel@joelfernandes.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=oleksiy.avramchenko@sony.com \
    --cc=paulmck@kernel.org \
    --cc=rcu@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.