Re: [PATCH v2] rcu/kfree: Do not request RCU when not needed

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Uladzislau Rezki <urezki@gmail.com>
To: Joel Fernandes <joel@joelfernandes.org>
Cc: Uladzislau Rezki <urezki@gmail.com>,
	linux-kernel@vger.kernel.org, paulmck@kernel.org,
	rcu@vger.kernel.org
Subject: Re: [PATCH v2] rcu/kfree: Do not request RCU when not needed
Date: Thu, 10 Nov 2022 15:01:30 +0100	[thread overview]
Message-ID: <Y20EOinwcLSZHmXg@pc638.lan> (raw)
In-Reply-To: <CAEXW_YSq89xzgyQ9Tdt1tCqz8VAfzb7kSXVZmnxDuJ65U0UZ3w@mail.gmail.com>

> Hi,
> 
> On Thu, Nov 10, 2022 at 8:05 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> 
> > > On ChromeOS, using this with the increased timeout, we see that we
> > almost always
> > > never need to initiate a new grace period. Testing also shows this frees
> > large
> > > amounts of unreclaimed memory, under intense kfree_rcu() pressure.
> > >
> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > ---
> > > v1->v2: Same logic but use polled grace periods instead of sampling
> > gp_seq.
> > >
> > >  kernel/rcu/tree.c | 8 +++++++-
> > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 591187b6352e..ed41243f7a49 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -2935,6 +2935,7 @@ struct kfree_rcu_cpu_work {
> > >
> > >  /**
> > >   * struct kfree_rcu_cpu - batch up kfree_rcu() requests for RCU grace
> > period
> > > + * @gp_snap: The GP snapshot recorded at the last scheduling of monitor
> > work.
> > >   * @head: List of kfree_rcu() objects not yet waiting for a grace period
> > >   * @bkvhead: Bulk-List of kvfree_rcu() objects not yet waiting for a
> > grace period
> > >   * @krw_arr: Array of batches of kfree_rcu() objects waiting for a
> > grace period
> > > @@ -2964,6 +2965,7 @@ struct kfree_rcu_cpu {
> > >       struct kfree_rcu_cpu_work krw_arr[KFREE_N_BATCHES];
> > >       raw_spinlock_t lock;
> > >       struct delayed_work monitor_work;
> > > +     unsigned long gp_snap;
> > >       bool initialized;
> > >       int count;
> > >
> > > @@ -3167,6 +3169,7 @@ schedule_delayed_monitor_work(struct kfree_rcu_cpu
> > *krcp)
> > >                       mod_delayed_work(system_wq, &krcp->monitor_work,
> > delay);
> > >               return;
> > >       }
> > > +     krcp->gp_snap = get_state_synchronize_rcu();
> > >       queue_delayed_work(system_wq, &krcp->monitor_work, delay);
> > >  }
> > >
> > How do you guarantee a full grace period for objects which proceed
> > to be placed into an input stream that is not yet detached?
> 
> 
> Just replying from phone as I’m OOO today.
> 
> Hmm, so you’re saying that objects can be queued after the delayed work has
> been queued, but processed when the delayed work is run? I’m looking at
> this code after few years so I may have missed something.
> 
> That’s a good point and I think I missed that. I think your version did too
> but I’ll have to double check.
> 
> The fix then is to sample the clock for the latest object queued, not for
> when the delayed work is queued.
> 
The patch i sent gurantee it. Just in case see v2:

From 7ff4038d5dac8d2044b240adeee630ad7944ab8d Mon Sep 17 00:00:00 2001
From: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Date: Wed, 2 Nov 2022 19:26:27 +0100
Subject: [PATCH] rcu: kvfree_rcu: Reduce a memory footptint by using polling
 APIs

Total time taken by all kfree'ers: 6564718459 ns, loops: 10000, batches: 1110, memory footprint: 5057MB
Total time taken by all kfree'ers: 8431051895 ns, loops: 10000, batches: 1109, memory footprint: 2749MB
Total time taken by all kfree'ers: 9477830789 ns, loops: 10000, batches: 1158, memory footprint: 2934MB
Total time taken by all kfree'ers: 9950211144 ns, loops: 10000, batches: 981, memory footprint: 2704MB

with a patch:

Total time taken by all kfree'ers: 7712110118 ns, loops: 10000, batches: 1660, memory footprint: 91MB
Total time taken by all kfree'ers: 7002403664 ns, loops: 10000, batches: 1482, memory footprint: 86MB
Total time taken by all kfree'ers: 7842282319 ns, loops: 10000, batches: 1738, memory footprint: 86MB
Total time taken by all kfree'ers: 7230161977 ns, loops: 10000, batches: 1542, memory footprint: 72MB

Tested with NOCB option, all offloading CPUs:

kvm.sh --memory 10G --torture rcuscale --allcpus --duration 1 \
  --kconfig CONFIG_NR_CPUS=64 \
  --kconfig CONFIG_RCU_NOCB_CPU=y \
  --kconfig CONFIG_RCU_NOCB_CPU_DEFAULT_ALL=y \
  --bootargs "rcuscale.kfree_rcu_test=1 rcuscale.kfree_nthreads=16 \
  rcuscale.holdoff=20 rcuscale.kfree_loops=10000 torture.disable_onoff_at_boot" --trust-make

According to data there is a big gain in memory footprint with a patch.
It is because of call_rcu() and call_rcu_flush() take more effort and
time to queue a callback and then wait for a gp.

With polling API:
  a) we do not need to queue any callback;
  b) we might not even need wait for a GP completion.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 kernel/rcu/tree.c | 32 ++++++++++++++++++++++++++------
 1 file changed, 26 insertions(+), 6 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 76973d716921..2be4f80867f2 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2919,18 +2919,20 @@ struct kvfree_rcu_bulk_data {
 	((PAGE_SIZE - sizeof(struct kvfree_rcu_bulk_data)) / sizeof(void *))
 
 /**
+ * @rcu_work: A work to reclaim a memory after a grace period
  * struct kfree_rcu_cpu_work - single batch of kfree_rcu() requests
- * @rcu_work: Let queue_rcu_work() invoke workqueue handler after grace period
  * @head_free: List of kfree_rcu() objects waiting for a grace period
  * @bkvhead_free: Bulk-List of kvfree_rcu() objects waiting for a grace period
  * @krcp: Pointer to @kfree_rcu_cpu structure
+ * @gp_snap: A snapshot of current grace period
  */
 
 struct kfree_rcu_cpu_work {
-	struct rcu_work rcu_work;
+	struct work_struct rcu_work;
 	struct rcu_head *head_free;
 	struct kvfree_rcu_bulk_data *bkvhead_free[FREE_N_CHANNELS];
 	struct kfree_rcu_cpu *krcp;
+	unsigned long gp_snap;
 };
 
 /**
@@ -3064,10 +3066,11 @@ static void kfree_rcu_work(struct work_struct *work)
 	struct rcu_head *head, *next;
 	struct kfree_rcu_cpu *krcp;
 	struct kfree_rcu_cpu_work *krwp;
+	unsigned long this_krwp_gp_snap;
 	int i, j;
 
-	krwp = container_of(to_rcu_work(work),
-			    struct kfree_rcu_cpu_work, rcu_work);
+	krwp = container_of(work,
+		struct kfree_rcu_cpu_work, rcu_work);
 	krcp = krwp->krcp;
 
 	raw_spin_lock_irqsave(&krcp->lock, flags);
@@ -3080,8 +3083,15 @@ static void kfree_rcu_work(struct work_struct *work)
 	// Channel 3.
 	head = krwp->head_free;
 	krwp->head_free = NULL;
+
+	// Get the latest saved state of grace period
+	// associated with this "krwp" and objects there.
+	this_krwp_gp_snap = krwp->gp_snap;
 	raw_spin_unlock_irqrestore(&krcp->lock, flags);
 
+	// Check the state.
+	cond_synchronize_rcu(this_krwp_gp_snap);
+
 	// Handle the first two channels.
 	for (i = 0; i < FREE_N_CHANNELS; i++) {
 		for (; bkvhead[i]; bkvhead[i] = bnext) {
@@ -3212,12 +3222,22 @@ static void kfree_rcu_monitor(struct work_struct *work)
 
 			WRITE_ONCE(krcp->count, 0);
 
+			// 1. Take a snapshot for this krwp. Please note no
+			// more any objects can be added to channels which
+			// have already been attached.
+			//
+			// 2. Since a "krwp" has several free channels a GP
+			// status of latest attached one is recorded, i.e.
+			// it allows us to maintain a GP status for last in
+			// objects among all channels.
+			krwp->gp_snap = get_state_synchronize_rcu();
+
 			// One work is per one batch, so there are three
 			// "free channels", the batch can handle. It can
 			// be that the work is in the pending state when
 			// channels have been detached following by each
 			// other.
-			queue_rcu_work(system_wq, &krwp->rcu_work);
+			queue_work(system_wq, &krwp->rcu_work);
 		}
 	}
 
@@ -4808,7 +4828,7 @@ static void __init kfree_rcu_batch_init(void)
 		struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu);
 
 		for (i = 0; i < KFREE_N_BATCHES; i++) {
-			INIT_RCU_WORK(&krcp->krw_arr[i].rcu_work, kfree_rcu_work);
+			INIT_WORK(&krcp->krw_arr[i].rcu_work, kfree_rcu_work);
 			krcp->krw_arr[i].krcp = krcp;
 		}
 
> > > @@ -3217,7 +3220,10 @@ static void kfree_rcu_monitor(struct work_struct
> > *work)
> > >                       // be that the work is in the pending state when
> > >                       // channels have been detached following by each
> > >                       // other.
> > > -                     queue_rcu_work(system_wq, &krwp->rcu_work);
> > > +                     if (poll_state_synchronize_rcu(krcp->gp_snap))
> > > +                             queue_work(system_wq, &krwp->rcu_work.work
> > );
> > > +                     else
> > > +                             queue_rcu_work(system_wq, &krwp->rcu_work);
> > >               }
> > >
> > Why do you want to queue a work over RCU-core?
> >
> > 1.
> > call_rcu()
> >    -> queue_work();
> >       -> do reclaim
> >
> > if it can be improved and simplified as:
> >
> > 2.
> > queue_work();
> >     -> cond_synchronize_rcu(), do reclaim
> 
> 
> The second one blocks, the first doesn’t. So I’m not sure how that’s an
> improvement? I think we don’t want to block in kworker due to possible
> scheduling issues and hurting other kwork during critical operations, if
> possible.
> 
Does the first wait for a full grace period so our kworker is not even run?

--
Uladzislau Rezki

next prev parent reply	other threads:[~2022-11-10 14:01 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-09  2:47 [PATCH v2] rcu/kfree: Do not request RCU when not needed Joel Fernandes (Google)
2022-11-10 13:05 ` Uladzislau Rezki
     [not found]   ` <CAEXW_YSq89xzgyQ9Tdt1tCqz8VAfzb7kSXVZmnxDuJ65U0UZ3w@mail.gmail.com>
2022-11-10 14:01     ` Uladzislau Rezki [this message]
2022-11-11  1:56       ` Joel Fernandes
2022-11-14 12:20         ` Uladzislau Rezki
2022-11-14 16:17           ` Paul E. McKenney
2022-11-14 20:54             ` Joel Fernandes
2022-11-14 21:26               ` Paul E. McKenney
2022-11-15 12:05             ` Uladzislau Rezki
2022-11-14 20:49           ` Joel Fernandes
2022-11-15 13:07             ` Uladzislau Rezki
2022-11-16 19:19               ` Uladzislau Rezki
2022-11-16 22:05                 ` Joel Fernandes
2022-11-17 12:58                   ` Uladzislau Rezki
2022-11-17 13:06                     ` Joel Fernandes
2022-11-17 13:11                       ` Uladzislau Rezki
2022-11-17 13:23                         ` Joel Fernandes
2022-11-17 13:43                           ` Uladzislau Rezki
  -- strict thread matches above, loose matches on Subject: below --
2022-11-04 14:21 Joel Fernandes (Google)

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:76973d71692 dfblob:2be4f80867f )
 OR (
bs:"rcu: kvfree_rcu: Reduce a memory footptint by using polling" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y20EOinwcLSZHmXg@pc638.lan \
    --to=urezki@gmail.com \
    --cc=joel@joelfernandes.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=paulmck@kernel.org \
    --cc=rcu@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.