Re: [RESEND][PATCH v2] mm: don't call lru draining in the nested lru_cache_disable

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Minchan Kim <minchan@kernel.org>
To: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@redhat.com>,
	linux-mm <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	John Dias <joaodias@google.com>
Subject: Re: [RESEND][PATCH v2] mm: don't call lru draining in the nested lru_cache_disable
Date: Tue, 25 Jan 2022 13:06:17 -0800	[thread overview]
Message-ID: <YfBmSaMa826ZhFT4@google.com> (raw)
In-Reply-To: <Ye/Bgc1bH979cXwy@dhcp22.suse.cz>

On Tue, Jan 25, 2022 at 10:23:13AM +0100, Michal Hocko wrote:
> On Mon 24-01-22 14:22:03, Minchan Kim wrote:
> [...]
> >  CPU 0                               CPU 1
> > 
> >  lru_cache_disable                  lru_cache_disable
> >    ret = atomic_inc_return;(ret = 1)
> >                                      
> >                                     ret = atomic_inc_return;(ret = 2)
> >                                     
> >    lru_add_drain_all(true);         
> >                                     lru_add_drain_all(false)
> >                                     mutex_lock() is holding
> >    mutex_lock() is waiting
> > 
> >                                     IPI with !force_all_cpus
> >                                     ...
> >                                     ...
> >                                     IPI done but it skipped some CPUs
> >                
> >      ..
> >      ..
> >  
> > 
> > Thus, lru_cache_disable on CPU 1 doesn't run with every CPUs so it
> > introduces race of lru_disable_count so some pages on cores
> > which didn't run the IPI could accept upcoming pages into per-cpu
> > cache.
> 
> Yes, that is certainly possible but the question is whether it really
> matters all that much. The race would require also another racer to be
> adding a page to an _empty_ pcp list at the same time.
> 
> pagevec_add_and_need_flush
>   1) pagevec_add # add to pcp list
>   2) lru_cache_disabled
>     atomic_read(lru_disable_count) = 0
>   # no flush but the page is on pcp
> 
> There is no strong memory ordering between 1 and 2 and that is why we
> need an IPI to enforce it in general IIRC.

Correct.

> 
> But lru_cache_disable is not a strong synchronization primitive. It aims
> at providing a best effort means to reduce false positives, right? IMHO

Nope. d479960e44f27, mm: disable LRU pagevec during the migration temporarily

Originally, it was designed to close the race fundamentally.

> it doesn't make much sense to aim for perfection because all users of
> this interface already have to live with temporary failures and pcp
> caches is not the only reason to fail - e.g. short lived page pins.

short lived pages are true but that doesn't mean we need to make the
allocation faster. As I mentioned, the IPI takes up to hundreds
milliseconds easily once CPUs are fully booked. If we reduce the
cost, we could spend the time more productive works. I am working
on making CMA more determinstic and this patch is one of parts.

> 
> That being said, I would rather live with a best effort and simpler
> implementation approach rather than aim for perfection in this case.
> The scheme is already quite complex and another lock in the mix doesn't

lru_add_drain_all already hides the whole complexity inside and
lru_cache_disable adds A simple synchroniztion to keep ordering
on top of it. That's natural SW stack and I don't see too complication
here.

> make it any easier to follow. If others believe that another lock makes

Disagree. lru_cache_disable is designed to guarantee closing the race
you are opening again so the other code in allocator since disabling
per-cpu cache doesn't need to consider the race at all. It's more
simple/deterministic and we could make other stuff based on it(e.g.,
zone->pcp). 

> the implementation more straightforward I will not object but I would go
> with the following.
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index ae8d56848602..c140c3743b9e 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -922,7 +922,8 @@ atomic_t lru_disable_count = ATOMIC_INIT(0);
>   */
>  void lru_cache_disable(void)
>  {
> -	atomic_inc(&lru_disable_count);
> +	int count = atomic_inc_return(&lru_disable_count);
> +
>  #ifdef CONFIG_SMP
>  	/*
>  	 * lru_add_drain_all in the force mode will schedule draining on
> @@ -931,8 +932,28 @@ void lru_cache_disable(void)
>  	 * The atomic operation doesn't need to have stronger ordering
>  	 * requirements because that is enforeced by the scheduling
>  	 * guarantees.
> +	 * Please note that there is a potential for a race condition:
> +	 * CPU0				CPU1			CPU2
> +	 * pagevec_add_and_need_flush
> +	 *   pagevec_add # to the empty list
> +	 *   lru_cache_disabled
> +	 *     atomic_read # 0
> +	 *   				lru_cache_disable	lru_cache_disable
> +	 *				  atomic_inc_return (1)
> +	 *				  			  atomic_inc_return (2)
> +	 *				  __lru_add_drain_all(true)
> +	 *				  			  __lru_add_drain_all(false)
> +	 *				  			    mutex_lock
> +	 *				    mutex_lock
> +	 *				    			    # skip cpu0 (pagevec_add not visible yet)
> +	 *				    			    mutex_unlock
> +	 *				    			 # fail because of pcp(0) pin
> +	 *				    queue_work_on(0)
> +	 *
> +	 * but the scheme is a best effort and the above race quite unlikely
> +	 * to matter in real life.
>  	 */
> -	__lru_add_drain_all(true);
> +	__lru_add_drain_all(count == 1);
>  #else
>  	lru_add_and_bh_lrus_drain();
>  #endif
> -- 
> Michal Hocko
> SUSE Labs

next prev parent reply	other threads:[~2022-01-25 21:06 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-30 19:36 [RESEND][PATCH v2] mm: don't call lru draining in the nested lru_cache_disable Minchan Kim
2022-01-06 18:14 ` Minchan Kim
2022-01-17 13:47 ` Michal Hocko
2022-01-19  0:12   ` Minchan Kim
2022-01-19  9:20     ` Michal Hocko
2022-01-20  4:25       ` Minchan Kim
2022-01-20  8:24         ` Michal Hocko
2022-01-20 21:07           ` Minchan Kim
2022-01-21  9:59             ` Michal Hocko
2022-01-21 21:56               ` Minchan Kim
2022-01-24  9:57                 ` Michal Hocko
2022-01-24 22:22                   ` Minchan Kim
2022-01-25  9:23                     ` Michal Hocko
2022-01-25 21:06                       ` Minchan Kim [this message]
2022-01-26 12:09                         ` Michal Hocko
2022-01-20  8:42     ` David Hildenbrand
2022-01-20 21:22       ` Minchan Kim

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YfBmSaMa826ZhFT4@google.com \
    --to=minchan@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=joaodias@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=surenb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.