All of lore.kernel.org
 help / color / mirror / Atom feed
From: Uladzislau Rezki <urezki@gmail.com>
To: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Uladzislau Rezki <urezki@gmail.com>,
	LKML <linux-kernel@vger.kernel.org>, RCU <rcu@vger.kernel.org>,
	Michal Hocko <mhocko@suse.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Daniel Axtens <dja@axtens.net>,
	Frederic Weisbecker <frederic@kernel.org>,
	Neeraj Upadhyay <neeraju@codeaurora.org>,
	Joel Fernandes <joel@joelfernandes.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	"Theodore Y . Ts'o" <tytso@mit.edu>,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>,
	Zhang Qiang <qiang.zhang@windriver.com>
Subject: Re: [PATCH v2 1/1] kvfree_rcu: Release a page cache under memory pressure
Date: Thu, 18 Mar 2021 18:47:46 +0100	[thread overview]
Message-ID: <20210318174746.GA10488@pc638.lan> (raw)
In-Reply-To: <20210316210125.GY2696@paulmck-ThinkPad-P72>

On Tue, Mar 16, 2021 at 02:01:25PM -0700, Paul E. McKenney wrote:
> On Tue, Mar 16, 2021 at 09:42:07PM +0100, Uladzislau Rezki wrote:
> > > On Wed, Mar 10, 2021 at 09:07:57PM +0100, Uladzislau Rezki (Sony) wrote:
> > > > From: Zhang Qiang <qiang.zhang@windriver.com>
> > > > 
> > > > Add a drain_page_cache() function to drain a per-cpu page cache.
> > > > The reason behind of it is a system can run into a low memory
> > > > condition, in that case a page shrinker can ask for its users
> > > > to free their caches in order to get extra memory available for
> > > > other needs in a system.
> > > > 
> > > > When a system hits such condition, a page cache is drained for
> > > > all CPUs in a system. Apart of that a page cache work is delayed
> > > > with 5 seconds interval until a memory pressure disappears.
> > > 
> > > Does this capture it?
> > > 
> > It would be good to have kind of clear interface saying that:
> > 
> > - low memory condition starts;
> > - it is over, watermarks were fixed.
> > 
> > but i do not see it. Therefore 5 seconds back-off has been chosen
> > to make a cache refilling to be less aggressive. Suppose 5 seconds
> > is not enough, in that case the work will attempt to allocate some
> > pages using less permissive parameters. What means that if we are
> > still in a low memory condition a refilling will probably fail and
> > next job will be invoked in 5 seconds one more time.
> 
> I would like such an interface as well, but from what I hear it is
> easier to ask for than to provide.  :-/
> 
> > > ------------------------------------------------------------------------
> > > 
> > > Add a drain_page_cache() function that drains the specified per-cpu
> > > page cache.  This function is invoked on each CPU when the system
> > > enters a low-memory state, that is, when the shrinker invokes
> > > kfree_rcu_shrink_scan().  Thus, when the system is low on memory,
> > > kvfree_rcu() starts taking its slow paths.
> > > 
> > > In addition, the first subsequent attempt to refill the caches is
> > > delayed for five seconds.
> > > 
> > > ------------------------------------------------------------------------
> > > 
> > > A few questions below.
> > > 
> > > 							Thanx, Paul
> > > 
> > > > Co-developed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > Signed-off-by: Zqiang <qiang.zhang@windriver.com>
> > > > ---
> > > >  kernel/rcu/tree.c | 59 ++++++++++++++++++++++++++++++++++++++++-------
> > > >  1 file changed, 51 insertions(+), 8 deletions(-)
> > > > 
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > index 2c9cf4df942c..46b8a98ca077 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -3163,7 +3163,7 @@ struct kfree_rcu_cpu {
> > > >  	bool initialized;
> > > >  	int count;
> > > >  
> > > > -	struct work_struct page_cache_work;
> > > > +	struct delayed_work page_cache_work;
> > > >  	atomic_t work_in_progress;
> > > >  	struct hrtimer hrtimer;
> > > >  
> > > > @@ -3175,6 +3175,17 @@ static DEFINE_PER_CPU(struct kfree_rcu_cpu, krc) = {
> > > >  	.lock = __RAW_SPIN_LOCK_UNLOCKED(krc.lock),
> > > >  };
> > > >  
> > > > +// A page shrinker can ask for freeing extra pages
> > > > +// to get them available for other needs in a system.
> > > > +// Usually it happens under low memory condition, in
> > > > +// that case hold on a bit with page cache filling.
> > > > +static unsigned long backoff_page_cache_fill;
> > > > +
> > > > +// 5 seconds delay. That is long enough to reduce
> > > > +// an interfering and racing with a shrinker where
> > > > +// the cache is drained.
> > > > +#define PAGE_CACHE_FILL_DELAY (5 * HZ)
> > > > +
> > > >  static __always_inline void
> > > >  debug_rcu_bhead_unqueue(struct kvfree_rcu_bulk_data *bhead)
> > > >  {
> > > > @@ -3229,6 +3240,26 @@ put_cached_bnode(struct kfree_rcu_cpu *krcp,
> > > >  
> > > >  }
> > > >  
> > > > +static int
> > > > +drain_page_cache(struct kfree_rcu_cpu *krcp)
> > > > +{
> > > > +	unsigned long flags;
> > > > +	struct llist_node *page_list, *pos, *n;
> > > > +	int freed = 0;
> > > > +
> > > > +	raw_spin_lock_irqsave(&krcp->lock, flags);
> > > > +	page_list = llist_del_all(&krcp->bkvcache);
> > > > +	krcp->nr_bkv_objs = 0;
> > > > +	raw_spin_unlock_irqrestore(&krcp->lock, flags);
> > > > +
> > > > +	llist_for_each_safe(pos, n, page_list) {
> > > > +		free_page((unsigned long)pos);
> > > > +		freed++;
> > > > +	}
> > > > +
> > > > +	return freed;
> > > > +}
> > > > +
> > > >  /*
> > > >   * This function is invoked in workqueue context after a grace period.
> > > >   * It frees all the objects queued on ->bhead_free or ->head_free.
> > > > @@ -3419,7 +3450,7 @@ schedule_page_work_fn(struct hrtimer *t)
> > > >  	struct kfree_rcu_cpu *krcp =
> > > >  		container_of(t, struct kfree_rcu_cpu, hrtimer);
> > > >  
> > > > -	queue_work(system_highpri_wq, &krcp->page_cache_work);
> > > > +	queue_delayed_work(system_highpri_wq, &krcp->page_cache_work, 0);
> > > >  	return HRTIMER_NORESTART;
> > > >  }
> > > >  
> > > > @@ -3428,7 +3459,7 @@ static void fill_page_cache_func(struct work_struct *work)
> > > >  	struct kvfree_rcu_bulk_data *bnode;
> > > >  	struct kfree_rcu_cpu *krcp =
> > > >  		container_of(work, struct kfree_rcu_cpu,
> > > > -			page_cache_work);
> > > > +			page_cache_work.work);
> > > >  	unsigned long flags;
> > > >  	bool pushed;
> > > >  	int i;
> > > > @@ -3457,10 +3488,14 @@ run_page_cache_worker(struct kfree_rcu_cpu *krcp)
> > > >  {
> > > >  	if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING &&
> > > >  			!atomic_xchg(&krcp->work_in_progress, 1)) {
> > > > -		hrtimer_init(&krcp->hrtimer, CLOCK_MONOTONIC,
> > > > -			HRTIMER_MODE_REL);
> > > > -		krcp->hrtimer.function = schedule_page_work_fn;
> > > > -		hrtimer_start(&krcp->hrtimer, 0, HRTIMER_MODE_REL);
> > > > +		if (xchg(&backoff_page_cache_fill, 0UL)) {
> > > 
> > > How often can run_page_cache_worker() be invoked?  I am a bit
> > > concerned about the possibility of heavy memory contention on the
> > > backoff_page_cache_fill variable on large systems.  Unless there
> > > is something that sharply bounds the frequency of calls to
> > > run_page_cache_worker(), something like this would be more scalable:
> > > 
> > > 		if (backoff_page_cache_fill &&
> > > 		    xchg(&backoff_page_cache_fill, 0UL)) {
> > > 
> > It is called per-cpu. Once the cache is empty it will be called. Next time
> > will be after the worker completes filling the cache and krcp is run out of
> > cache again. I do not consider it as high contention on the backoff_page_cache_fill
> > variable. On my 64 CPUs system the run_page_cache_worker() itself does not
> > consume much CPU cycles during the test:
> > 
> > Samples: 2K of event 'cycles:k', Event count (approx.): 1372274198                                                                                                                       
> > Overhead  Command          Shared Object     Symbol                                                                                                                                      
> >   27.45%  kworker/0:2-eve  [kernel.vmlinux]  [k] kmem_cache_free_bulk                                                                                                                    
> >   14.56%  vmalloc_test/0   [kernel.vmlinux]  [k] kmem_cache_alloc_trace                                                                                                                  
> >   11.34%  vmalloc_test/0   [kernel.vmlinux]  [k] kvfree_call_rcu                                                                                                                         
> >    7.61%  vmalloc_test/0   [kernel.vmlinux]  [k] _raw_spin_unlock_irqrestore                                                                                                             
> >    7.60%  vmalloc_test/0   [kernel.vmlinux]  [k] allocate_slab                                                                                                                           
> >    5.38%  vmalloc_test/0   [kernel.vmlinux]  [k] check_preemption_disabled                                                                                                               
> >    3.12%  vmalloc_test/0   [kernel.vmlinux]  [k] _raw_spin_lock                                                                                                                          
> >    2.85%  vmalloc_test/0   [kernel.vmlinux]  [k] preempt_count_add                                                                                                                       
> >    2.64%  vmalloc_test/0   [kernel.vmlinux]  [k] __list_del_entry_valid                                                                                                                  
> >    2.53%  vmalloc_test/0   [kernel.vmlinux]  [k] preempt_count_sub                                                                                                                       
> >    1.81%  vmalloc_test/0   [kernel.vmlinux]  [k] native_write_msr                                                                                                                        
> >    1.05%  kworker/0:2-eve  [kernel.vmlinux]  [k] __slab_free                                                                                                                             
> >    0.96%  vmalloc_test/0   [kernel.vmlinux]  [k] asm_sysvec_apic_timer_interrupt                                                                                                         
> >    0.96%  vmalloc_test/0   [kernel.vmlinux]  [k] setup_object_debug.isra.69                                                                                                              
> >    0.76%  kworker/0:2-eve  [kernel.vmlinux]  [k] free_pcppages_bulk                                                                                                                      
> >    0.72%  kworker/0:2-eve  [kernel.vmlinux]  [k] put_cpu_partial                                                                                                                         
> >    0.72%  vmalloc_test/0   [test_vmalloc]    [k] kvfree_rcu_2_arg_slab_test                                                                                                              
> >    0.52%  kworker/0:2-eve  [kernel.vmlinux]  [k] kfree_rcu_work                                                                                                                          
> >    0.52%  vmalloc_test/0   [kernel.vmlinux]  [k] get_page_from_freelist                                                                                                                  
> >    0.52%  vmalloc_test/0   [kernel.vmlinux]  [k] run_page_cache_worker
> > 
> > <run_page_cache_worker>
> >        │    arch_atomic_xchg():
> >        │      mov   $0x1,%eax
> >        │    run_page_cache_worker():
> >        │      push  %rbx
> >        │    arch_atomic_xchg():
> >        │      xchg  %eax,0x188(%rdi)
> >        │    run_page_cache_worker():
> > 100.00 │      test  %eax,%eax
> > <run_page_cache_worker>
> > 
> > <snip>
> >     if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING &&
> >         !atomic_xchg(&krcp->work_in_progress, 1)) { <-- here all cycles of run_page_cache_worker()
> > <snip>
> 
> Understood, and the concern isn't so much lots of CPU time being burned
> by the function, but rather the behavior when timing lines up badly.
> 
> > > It looks to me like all the CPUs could invoke run_page_cache_worker()
> > > at the same time.  Or am I missing something that throttles calls to
> > > run_page_cache_worker(), even on systems with hundreds of CPUs?
> > >
> > It is per-cpu, thus is serialized.
> 
> The cache is per-CPU, agreed, but backoff_page_cache_fill is global, right?
> 
Correct. But it should be fixed :)

> > > Also, if I am reading the code correctly, the unlucky first CPU to
> > > attempt to refill cache after a shrinker invocation would be delayed
> > > five seconds (thus invoking the slow path during that time), but other
> > > CPUs would continue unimpeded.  Is this the intent?
> > > 
> > A backoff_page_cache_fill is global and shared among all CPUs. So, if one
> > changes it following a slow path whereas all the rest will refill their
> > caches anyway following a fast path.
> > 
> > That should be fixed making it per-cpu also. A shrinker should mark each
> > CPU to back-off refilling.
> 
> That would be much better!
> 
> > > If I understand correctly, the point is to avoid the situation where
> > > memory needed elsewhere is drained and then immediately refilled.
> > > But the code will do the immediate refill when the rest of the CPUs show
> > > up, correct?
> > >
> > Correct. We do not want to request pages for some period of time, because
> > they might be needed for other needs and other users in a system. We have
> > fall-backs, so there is no a high demand in it for our case.
> > 
> > > 
> > > Might it be better to put a low cap on the per-CPU caches for some
> > > period of time after the shrinker runs?  Maybe allow at most one page
> > > to be cached for the five seconds following?
> > > 
> > That we can do!
> > 
> > > > +			queue_delayed_work(system_wq,
> > > > +				&krcp->page_cache_work, PAGE_CACHE_FILL_DELAY);
> > > > +		} else {
> > > > +			hrtimer_init(&krcp->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> > > > +			krcp->hrtimer.function = schedule_page_work_fn;
> > > > +			hrtimer_start(&krcp->hrtimer, 0, HRTIMER_MODE_REL);
> > > > +		}
> > > >  	}
> > > >  }
> > > >  
> > > > @@ -3612,14 +3647,20 @@ kfree_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
> > > >  {
> > > >  	int cpu;
> > > >  	unsigned long count = 0;
> > > > +	unsigned long flags;
> > > >  
> > > >  	/* Snapshot count of all CPUs */
> > > >  	for_each_possible_cpu(cpu) {
> > > >  		struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu);
> > > >  
> > > >  		count += READ_ONCE(krcp->count);
> > > > +
> > > > +		raw_spin_lock_irqsave(&krcp->lock, flags);
> > > > +		count += krcp->nr_bkv_objs;
> > > > +		raw_spin_unlock_irqrestore(&krcp->lock, flags);
> > > 
> > > Not a big deal given that this should not be invoked often, but couldn't
> > > the read from ->nr_bkv_objs be READ_ONCE() without the lock?  (This would
> > > require that updates use WRITE_ONCE() as well.)
> > > 
> > I was thinking about it. Will re-spin and rework :)
> 
> Sounds good, looking forward to seeing what you guys come up with!
> 
Working on it!

--
Vlad Rezki

      reply	other threads:[~2021-03-18 17:48 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-10 20:07 [PATCH v2 1/1] kvfree_rcu: Release a page cache under memory pressure Uladzislau Rezki (Sony)
2021-03-16 16:31 ` Paul E. McKenney
2021-03-16 20:42   ` Uladzislau Rezki
2021-03-16 21:01     ` Paul E. McKenney
2021-03-18 17:47       ` Uladzislau Rezki [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210318174746.GA10488@pc638.lan \
    --to=urezki@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=bigeasy@linutronix.de \
    --cc=dja@axtens.net \
    --cc=frederic@kernel.org \
    --cc=joel@joelfernandes.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mhocko@suse.com \
    --cc=neeraju@codeaurora.org \
    --cc=oleksiy.avramchenko@sonymobile.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=qiang.zhang@windriver.com \
    --cc=rcu@vger.kernel.org \
    --cc=tglx@linutronix.de \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.